Things that happen on christmas morning

Hello friends of the Oracle Database Administration and Troubleshooting,

here’s another story for you. Once upon a time on christmas day (that would be December the 25th) I got a call from work because I was the Oracle DBA On-Call that day. Yay.

Usually On-Call is pretty calm with no calls on most days, but this christmas we had an unfortunate coincidence, that the Unix Patch Days in our company fell right upon Christmas, so I got a nice little Christmas Present which was a call, because on one database server the Oracle Grid/ASM Software didn’t start correctly.

So, my Christmas Present was to investigate as to why this happened. And boy did I have fun, as this was a mixture of something quite unusual and later on a small human error which lead to a second error that came up after I found and corrected the first one.

So after checking that some processes like the HAS Daemon where running, I took a look into the Alert Log of the ASM Instance and I found some quite interesting Error Messages in between some status messages.

2022-12-25T06:38:21.239192+01:00
Errors in file /u01/oracle/diag/asm/+asm/+ASM/trace/+ASM_rbal_11371.trc:
ORA-15183: ASMLIB initialization error [driver/agent not installed]
2022-12-25T06:38:21.239280+01:00
WARNING: FAILED to load library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so

SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs // {0:0:208} */
2022-12-25T06:38:21.326369+01:00
NOTE: Diskgroups listed in ASM_DISKGROUP are
ASMFR1

ASMDG1

2022-12-25T06:38:21.327622+01:00
NOTE: ASM instance +ASM is discoverable by local clients on node dbhost01
2022-12-25T06:38:21.336612+01:00
NOTE: cache registered group ASMDG1 1/0x41816392
NOTE: cache began mount (first) of group ASMDG1 1/0x41816392
NOTE: cache registered group ASMFR1 2/0x41816393
NOTE: cache began mount (first) of group ASMFR1 2/0x41816393
2022-12-25T06:38:21.371829+01:00
ERROR: no read quorum in group: required 2, found 0 disks
2022-12-25T06:38:21.372163+01:00
NOTE: cache dismounting (clean) group 1/0x41816392 (ASMDG1)
NOTE: messaging CKPT to quiesce pins Unix process pid: 11383, image: oracle@dbhost01 (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: LGWR not being messaged to dismount
NOTE: cache dismounted group 1/0x41816392 (ASMDG1)
NOTE: cache ending mount (fail) of group ASMDG1 number=1 incarn=0x41816392
NOTE: cache deleting context for group ASMDG1 1/0x41816392
2022-12-25T06:38:21.408229+01:00
GMON dismounting group 1 at 2 for pid 24, osid 11383
2022-12-25T06:38:21.409859+01:00
ERROR: diskgroup ASMDG1 was not mounted

So, okay. Failed to load library. Let’s check if the library exists and how the permissions are set:

[oracle@dbhost01:+ASM]/home/oracle: ls -la /opt/oracle/extapi/64/asm/orcl/1/libasm.so
-rwxr-xr-x 1 root root 32992 Feb 25 2020 /opt/oracle/extapi/64/asm/orcl/1/libasm.so
[oracle@dbhost01:+ASM]/home/oracle: ldd /opt/oracle/extapi/64/asm/orcl/1/libasm.so
linux-vdso.so.1 (0x00007fff144c5000)
libc.so.6 => /lib64/libc.so.6 (0x00007f3a0c3fe000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3a0c9cc000)
[oracle@dbhost01:+ASM]/home/oracle:

Well, this looks quite good. So, what else could it be?
I had no idea, so I asked the well known Senior DBA (okay…I typed the error into a well known search engine).
The results haven’t been good. Well, yes. I obviously checked file permissions already, but then there was a hint towards the oracleasm tool to show the actual configuration. So I did:

[oracle@dbhost01:+ASM]/home/oracle: /usr/sbin/oracleasm configure
ORACLEASM_ENABLED=true
ORACLEASM_UID=
ORACLEASM_GID=
ORACLEASM_SCANBOOT=true
ORACLEASM_SCANORDER="dm"
ORACLEASM_SCANEXCLUDE="sd"
ORACLEASM_SCAN_DIRECTORIES=""
ORACLEASM_USE_LOGICAL_BLOCK_SIZE="false"
[oracle@dbhost01:+ASM]/home/oracle:

So, let’s compare this with another ASM installation.

[oracle@dbhost02:+ASM]/home/oracle: /usr/sbin/oracleasm configure
ORACLEASM_ENABLED=true
ORACLEASM_UID=5000
ORACLEASM_GID=300
ORACLEASM_SCANBOOT=true
ORACLEASM_SCANORDER="dm"
ORACLEASM_SCANEXCLUDE="sd"
ORACLEASM_SCAN_DIRECTORIES=""
ORACLEASM_USE_LOGICAL_BLOCK_SIZE="false"

Well, there is a slight difference between these two. On the second server UID and GID have been set to the appropriate values for the oracle user and dba group.
So, by calling the command „/usr/sbin/oracleasm configure -i“ I was able to set the needed values.
ASM was now able to start and find all necessary devices. A quick reboot of the server to check wether this is reboot-proof showed the ASM looked good.

But now we had another problem, because the Database did not want to start. So, let’s take a look at what’s going on there.

The Alert Log showed the following errors:

Sys-V shared memory will be used for creating SGA


2022-12-25T06:57:35.084474+01:00
Error: ENOSPC while creating shared memory segment of size 23555211264 bytes
Check the system shared memory parameters
2022-12-25T06:57:35.084533+01:00
Error: ENOSPC while creating shared memory segment of size 23555211264 bytes
Check the system shared memory parameters
.
.
.
2022-12-25T06:57:35.084803+01:00
PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s)
2022-12-25T06:57:35.084830+01:00
4K Configured 14 0 ORA-27125
2022-12-25T06:57:35.084874+01:00
2048K 11265 11265 5 ORA-27125
2022-12-25T06:57:35.084916+01:00

2022-12-25T06:57:35.084958+01:00
SGA: Realm creation failed

Okay. What is going on here? Aren’t these enough hugepages?

[oracle@dbhost01:ORCL]/u01/oracle/ORCL/logs: cat /proc/meminfo | grep Huge
AnonHugePages: 784384 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 11275
HugePages_Free: 31
HugePages_Rsvd: 21
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 23091200 kB

Well…something is using the databases hugepages.
Let’s check the parameters.
The Databases Alert Log says, it want’s about 20G.

Starting ORACLE instance (normal) (OS id: 17486)
2022-12-25T07:09:20.609947+01:00


Instance SGA_TARGET = 22528 MB and SGA_MAX_SIZE = 22528 MB


That’s fine. One Hugepage is 2M. So we should have enough if we look at the values above.
There is no other database running on the machine, so where have they gone?

Well…not database exactly, but we have an ASM Instance.
Let’s check the SGA Parameter there:

SQL> show parameter sga

NAME TYPE VALUE


sga_max_size big integer 22G
sga_target big integer 22G

Wow…22G for an ASM Instance, that’s a lot. Let’s scale that down to 2G and restart the ASM.

alter system set sga_max_size=2G scope=spfile;
alter system set sga_target=2G scope=spfile;
shutdown immediate;
startup;

After succesful restart of the ASM instance we have now been able to restart the Database.

But how did this happen?
Well, as often is the case, human error. The longer you are working in IT the more often things like these will happen.
It was planned to change the parameter in the Database, but the DBA changed into the ASM environment by accident without really registering what they did and changed the parameters there. After restarting the database and seeing that seemingly nothing changed they changed the parameters again, this time being in the correct environment and thus having succes after a database restart.

So, this was fine until the Server rebooted and thus the ASM Instance restarted with a bigger SGA.

I hope you like these kind of stories as they might give you hints how to investigate similar problems.

Things that happen on a friday afternoon.

This is a story of wrongfully removed ASMLIB devices, a second listener network that won’t network and crashed RAC Instances that perhaps could have been prevented (you tell me)

So, I try to retell this story as best as I can and try to get the order of what happened right. This is to show you how I investigage problems and try to solve them. Are all of my solutions perfect? Absolutely not. Could we have done something different? I’m quiet sure of it. Ideas are welcome so I may learn something.

How it began

I got a call from a colleague who told me: „Markus, I firetrucked up, I removed the wrong block devices and I already delete the headers.“
Well, it turned out the removed and cleared devices belonging to an ASM Diskgroup called REDO2 with EXTERNAL REDUNDANCY.

This was none of my primary customers, but I was on troubleshooting duty this afternoon and so I looked at it.
After connecting to the server I found a running ASM on both nodes and running database instances.

The ASM Alertlog showed me the following (and this quiet often for three disks):

2022-10-21T15:12:29.252796+02:00
Errors in file /oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_26929.trc:
ORA-15025: could not open disk "/dev/oracleasm/disks/ASMLIB_REDO_5"
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
2022-10-21T15:12:29.252893+02:00
WARNING: Read Failed. group:5 disk:4 AU:12 offset:0 size:4096
path:Unknown disk
         incarnation:0xf0f07718 synchronous result:'I/O error'
         subsys:Unknown library krq:0x7fce78793a20 bufp:0x7fce72e8e000 osderr1:0x434c5344 osderr2:0x0
         IO elapsed time: 0 usec Time waited on I/O: 0 usec
WARNING: cache failed reading from group=5(REDO2) fn=8 blk=0 count=1 from disk=4 (ASMLIB_REDO_5) mirror=0 kfkist=0x20 status=0x02 osderr=0x434c5344 file=kfc.c line=13461
ERROR: cache failed to read group=5(REDO2) fn=8 blk=0 from disk(s): 4(ASMLIB_REDO_5)
ORA-15080: synchronous I/O operation failed to read block 0 of disk 4 in disk group REDO2

So, the ASM was aware, that something strange is going on.
I saw that the Diskgroup REDO2 was kind of missing all of it’s disks.
Next, check the database instances which where still running

2022-10-21T15:15:40.789864+02:00
WARNING: Read Failed. group:5 disk:4 AU:364 offset:16384 size:16384
path:Unknown disk
         incarnation:0xf0f06811 synchronous result:'I/O error'
         subsys:Unknown library krq:0x7fc6dc3186d0 bufp:0x7fc6d6ce3000 osderr1:0x434c5344 osderr2:0x0
         IO elapsed time: 0 usec Time waited on I/O: 0 usec
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 265 in group [5.574653800] from disk ASMLIB_REDO_5  allocation unit 364 reason error; if possible, will try another mirror side

So yeah, the databases also knew there was something going on, but have still been running.
So, what data was on this diskgroup? The Diskgroup Name lead me to believe, that it should have been Redologs.
But, guessing is not knowing, so I checked:

select * from ( select name from v$controlfile union select member from v$logfile union select name from v$datafile union select name from v$tempfile ) where name like '+REDO%'; +REDO1/ORCL/CONTROLFILE/current.265.970226155 +REDO1/ORCL/ONLINELOG/group_1.264.1118685065 +REDO1/ORCL/ONLINELOG/group_2.258.1118685067 +REDO1/ORCL/ONLINELOG/group_3.257.1118685127 +REDO1/ORCL/ONLINELOG/group_4.263.1118685069 +REDO1/ORCL/ONLINELOG/group_5.260.1118685123 +REDO1/ORCL/ONLINELOG/group_6.259.1118685137 +REDO1/ORCL/ONLINELOG/group_7.262.1118685071 +REDO1/ORCL/ONLINELOG/group_8.261.1118685071 +REDO2/ORCL/CONTROLFILE/current.256.1118683905 +REDO2/ORCL/ONLINELOG/group_1.257.1118685067 +REDO2/ORCL/ONLINELOG/group_2.258.1118685067 +REDO2/ORCL/ONLINELOG/group_3.263.1118685129 +REDO2/ORCL/ONLINELOG/group_4.259.1118685069 +REDO2/ORCL/ONLINELOG/group_5.262.1118685123 +REDO2/ORCL/ONLINELOG/group_6.264.1118685139 +REDO2/ORCL/ONLINELOG/group_7.260.1118685071 +REDO2/ORCL/ONLINELOG/group_8.261.1118685073

Okay, Redologs and Controlfiles. The same amount on both REDO1 and REDO2 Diskgroup. Let’s check that out.


                                TYPE        VALUE
------------------------------------ ----------- ------------------------------
control_files                        string     +REDO1/ORCL/CONTROLFILE/cur
                                                 rent.265.970226155, +REDO2/ORC
                                                 L/CONTROLFILE/current.256.1118
                                                 683905

Great, the controlfile really is a mirror. We can work with that. Now check the Redologs

GROUP# THREAD# MEMBERS ---------- ---------- ---------- 1 1 2 2 1 2 3 1 2 4 1 2 5 2 2 6 2 2 7 2 2 8 2 2

Okay. Eight Groups, two threads and two members. So we are at 16 Redologs, 8 per Diskgroup.

Next step?
Get Diskgroup REDO2 running again.

I wasn’t able to see the diskgroup correctly in our internal tooling, so I tried to mount it, which lead to the following:

2022-10-21T16:06:59.292016+02:00 ERROR: diskgroup REDO2 was not mounted ORA-15032: not all alterations performed ORA-15017: diskgroup "REDO2" cannot be mounted ORA-15040: diskgroup is incomplete

Okay, that was to be expected.

Then dismount. Drop, whatever. I did so and the database instances all crashed at once.
Remember, at this point the diskgroup was not accessible for the databases, files there couldn’t be accessed.
But now the instances are crashing. Okay Boomer!

Next, recreate the diskgroup:

create diskgroup REDO2 external redundancy disk '/dev/oracleasm/disks/ASMLIB_REDO_4' , '/dev/oracleasm/disks/ASMLIB_REDO_5', '/dev/oracleasm/disks/ASMLIB_REDO_6' attribute 'COMPATIBLE.ASM' ='12.2.0.1.0', 'COMPATIBLE.RDBMS'='10.1.0.0.0', 'AU_SIZE'='4M'

The diskgroup was succesfully created.

Next step, we know, the databases will be missing a controlfile, so we need to restart in NOMOUNT state, which lead to the following messages:

Errors in file /oracle/diag/rdbms/orcl/ORCL1/trace/ORCL1_ora_20996.trc (incident=624164): ORA-00600: internal error code, arguments: [ipc_initialize_3], [1], [16], [], [], [], [], [], [], [], [], [] Incident details in: /oracle/diag/rdbms/orcl/ORCL1/incident/incdir_624164/ORCL1_ora_20996_i624164.trc

FIRETRUCKING FANTASTIC

So, a search for the error showed, that perhaps I used the wrong OS User to start the database.
Nope, didn’t.
Then I read something of a Network not being correct?
Okay, what is going on with our networks? Let’s check the clusterware

[oracle@node1:+ASM1]/DBA/nest/oracle/ORCL1/logs: crsctl stat res -t -------------------------------------------------------------------------------- Name Target State Server State details -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.LISTENER.lsnr ONLINE ONLINE node1 STABLE ONLINE ONLINE node2 STABLE ora.chad ONLINE ONLINE node1 STABLE ONLINE ONLINE node2 STABLE ora.net1.network ONLINE ONLINE node1 STABLE ONLINE ONLINE node2 STABLE ora.net2.network ONLINE OFFLINE node1 ONLINE OFFLINE node2 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.scan1.vip 1 ONLINE ONLINE node2 STABLE ora.scan2.vip 1 ONLINE ONLINE node1 STABLE ora.scan3.vip 1 ONLINE ONLINE node1 STABLE ora.node1.vip 1 ONLINE ONLINE node1 STABLE ora.node1_2.vip 1 ONLINE OFFLINE node1 ora.node2.vip 1 ONLINE ONLINE node2 STABLE ora.node2_2.vip 1 ONLINE OFFLINE node2 --------------------------------------------------------------------------------

Okay, I cropped the output a bit and left the interesing things there for you to see.
So, we have a second Network going on in our clusterware and the VIPs from this network are offline.
Let’s try to start them up.

CRS-5006: Unable to automatically select a network interface which has subnet mask 255.254.0.0 and subnet number 101.18.0.0

Say what now?
Okay, what interfaces are currently running on my machine?
I see various interfaces, but the interesting one is this:


2: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet 101.19.46.51/24 brd 101.19.46.255 scope global eth1
       valid_lft forever preferred_lft forever

You’Re asking why? Let me tell you.
Because this if we take a look at the network mentioned in our error message, this IP Adress falls into the range.
Just to let you know, the network with subnet number 101.18.0.0 and subnet mask 255.254.0.0 has the following range:

Host-IPs from: 101.18.0.1 to: 101.19.255.254

Okay…was there some kind of reconfiguration in the past? I don’t know, but let’s corect this one.
So, per documentation, the VIPs should have the following IP Adresses: 101.19.23.115 and 101.19.23.116

Let’s kick them out and create them anew

srvctl remove vip -vip node1_2
srvctl remove vip -vip node2_2
srvctl add vip -node node1 -address 101.19.23.115/255.255.255.0 -netnum 2
srvctl add vip -node node2 -address 101.19.23.116/255.255.255.0 -netnum 2

How is the current configuration in the clusterware?

srvctl config vip -vip node2_2
VIP exists: network number 2, hosting node node2
VIP IPv4 Address: 101.19.23.116
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:
srvctl config vip -vip node1_2
VIP exists: network number 2, hosting node node1
VIP IPv4 Address: 101.19.23.115
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:

Okay. Now it’s time to try and start one instance.

srvctl start instance -db ORCL -instance ORCL1 -startoption nomount

That worked and next let’s start rman and restore the REDO2 controlfile from REDO1

rman target /
restore controlfile from '+REDO1/ORCL/CONTROLFILE/current.265.970226155';
alter database mount;
alter database open;
exit

The last thing to do is clear the redo logs, just for funs sake.

Startup sqlplus and hit it:

ALTER DATABASE CLEAR LOGFILE GROUP 1;
ALTER DATABASE CLEAR LOGFILE GROUP 2;
ALTER DATABASE CLEAR LOGFILE GROUP 3;
ALTER DATABASE CLEAR LOGFILE GROUP 4;
ALTER DATABASE CLEAR LOGFILE GROUP 5;
ALTER DATABASE CLEAR LOGFILE GROUP 6;
ALTER DATABASE CLEAR LOGFILE GROUP 7;
ALTER DATABASE CLEAR LOGFILE GROUP 8;

If a redo log is still in use, set a checkpoint, switch the logfile an try again

alter system checkpoint;
alter system switch logfile;

And now you are back online from a wonderful journey through multiple errors.

Could I have done something better? Let me know.