Oktober 2022 – Life and daily business of a Database Administrator

This is a story of wrongfully removed ASMLIB devices, a second listener network that won’t network and crashed RAC Instances that perhaps could have been prevented (you tell me)

So, I try to retell this story as best as I can and try to get the order of what happened right. This is to show you how I investigage problems and try to solve them. Are all of my solutions perfect? Absolutely not. Could we have done something different? I’m quiet sure of it. Ideas are welcome so I may learn something.

How it began

I got a call from a colleague who told me: „Markus, I firetrucked up, I removed the wrong block devices and I already delete the headers.“
Well, it turned out the removed and cleared devices belonging to an ASM Diskgroup called REDO2 with EXTERNAL REDUNDANCY.

This was none of my primary customers, but I was on troubleshooting duty this afternoon and so I looked at it.
After connecting to the server I found a running ASM on both nodes and running database instances.

The ASM Alertlog showed me the following (and this quiet often for three disks):


2022-10-21T15:12:29.252796+02:00
Errors in file /oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_26929.trc:
ORA-15025: could not open disk "/dev/oracleasm/disks/ASMLIB_REDO_5"
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
2022-10-21T15:12:29.252893+02:00
WARNING: Read Failed. group:5 disk:4 AU:12 offset:0 size:4096
path:Unknown disk
         incarnation:0xf0f07718 synchronous result:'I/O error'
         subsys:Unknown library krq:0x7fce78793a20 bufp:0x7fce72e8e000 osderr1:0x434c5344 osderr2:0x0
         IO elapsed time: 0 usec Time waited on I/O: 0 usec
WARNING: cache failed reading from group=5(REDO2) fn=8 blk=0 count=1 from disk=4 (ASMLIB_REDO_5) mirror=0 kfkist=0x20 status=0x02 osderr=0x434c5344 file=kfc.c line=13461
ERROR: cache failed to read group=5(REDO2) fn=8 blk=0 from disk(s): 4(ASMLIB_REDO_5)
ORA-15080: synchronous I/O operation failed to read block 0 of disk 4 in disk group REDO2

So, the ASM was aware, that something strange is going on.
I saw that the Diskgroup REDO2 was kind of missing all of it’s disks.
Next, check the database instances which where still running


2022-10-21T15:15:40.789864+02:00
WARNING: Read Failed. group:5 disk:4 AU:364 offset:16384 size:16384
path:Unknown disk
         incarnation:0xf0f06811 synchronous result:'I/O error'
         subsys:Unknown library krq:0x7fc6dc3186d0 bufp:0x7fc6d6ce3000 osderr1:0x434c5344 osderr2:0x0
         IO elapsed time: 0 usec Time waited on I/O: 0 usec
WARNING: failed to read mirror side 1 of virtual extent 0 logical extent 0 of file 265 in group [5.574653800] from disk ASMLIB_REDO_5  allocation unit 364 reason error; if possible, will try another mirror side

So yeah, the databases also knew there was something going on, but have still been running.
So, what data was on this diskgroup? The Diskgroup Name lead me to believe, that it should have been Redologs.
But, guessing is not knowing, so I checked:


select * from (
select name from v$controlfile
union
select member from v$logfile
union
select name from v$datafile
union
select name from v$tempfile
) where name like '+REDO%';
+REDO1/ORCL/CONTROLFILE/current.265.970226155
+REDO1/ORCL/ONLINELOG/group_1.264.1118685065
+REDO1/ORCL/ONLINELOG/group_2.258.1118685067
+REDO1/ORCL/ONLINELOG/group_3.257.1118685127
+REDO1/ORCL/ONLINELOG/group_4.263.1118685069
+REDO1/ORCL/ONLINELOG/group_5.260.1118685123
+REDO1/ORCL/ONLINELOG/group_6.259.1118685137
+REDO1/ORCL/ONLINELOG/group_7.262.1118685071
+REDO1/ORCL/ONLINELOG/group_8.261.1118685071
+REDO2/ORCL/CONTROLFILE/current.256.1118683905
+REDO2/ORCL/ONLINELOG/group_1.257.1118685067
+REDO2/ORCL/ONLINELOG/group_2.258.1118685067
+REDO2/ORCL/ONLINELOG/group_3.263.1118685129
+REDO2/ORCL/ONLINELOG/group_4.259.1118685069
+REDO2/ORCL/ONLINELOG/group_5.262.1118685123
+REDO2/ORCL/ONLINELOG/group_6.264.1118685139
+REDO2/ORCL/ONLINELOG/group_7.260.1118685071
+REDO2/ORCL/ONLINELOG/group_8.261.1118685073

Okay, Redologs and Controlfiles. The same amount on both REDO1 and REDO2 Diskgroup. Let’s check that out.


                                TYPE        VALUE
------------------------------------ ----------- ------------------------------
control_files                        string     +REDO1/ORCL/CONTROLFILE/cur
                                                 rent.265.970226155, +REDO2/ORC
                                                 L/CONTROLFILE/current.256.1118
                                                 683905

Great, the controlfile really is a mirror. We can work with that. Now check the Redologs


    GROUP#    THREAD#    MEMBERS
---------- ---------- ----------
         1          1          2
         2          1          2
         3          1          2
         4          1          2
         5          2          2
         6          2          2
         7          2          2
         8          2          2

Okay. Eight Groups, two threads and two members. So we are at 16 Redologs, 8 per Diskgroup.

Next step?
Get Diskgroup REDO2 running again.

I wasn’t able to see the diskgroup correctly in our internal tooling, so I tried to mount it, which lead to the following:


2022-10-21T16:06:59.292016+02:00
ERROR: diskgroup REDO2 was not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "REDO2" cannot be mounted
ORA-15040: diskgroup is incomplete

Okay, that was to be expected.

Then dismount. Drop, whatever. I did so and the database instances all crashed at once.
Remember, at this point the diskgroup was not accessible for the databases, files there couldn’t be accessed.
But now the instances are crashing. Okay Boomer!

Next, recreate the diskgroup:


create diskgroup REDO2
external redundancy
disk '/dev/oracleasm/disks/ASMLIB_REDO_4'
, '/dev/oracleasm/disks/ASMLIB_REDO_5', '/dev/oracleasm/disks/ASMLIB_REDO_6'
attribute 'COMPATIBLE.ASM' ='12.2.0.1.0',
'COMPATIBLE.RDBMS'='10.1.0.0.0',
'AU_SIZE'='4M'

The diskgroup was succesfully created.

Next step, we know, the databases will be missing a controlfile, so we need to restart in NOMOUNT state, which lead to the following messages:


Errors in file /oracle/diag/rdbms/orcl/ORCL1/trace/ORCL1_ora_20996.trc  (incident=624164):
ORA-00600: internal error code, arguments: [ipc_initialize_3], [1], [16], [], [], [], [], [], [], [], [], []
Incident details in: /oracle/diag/rdbms/orcl/ORCL1/incident/incdir_624164/ORCL1_ora_20996_i624164.trc

FIRETRUCKING FANTASTIC

So, a search for the error showed, that perhaps I used the wrong OS User to start the database.
Nope, didn’t.
Then I read something of a Network not being correct?
Okay, what is going on with our networks? Let’s check the clusterware


[oracle@node1:+ASM1]/DBA/nest/oracle/ORCL1/logs: crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       node1                  STABLE
               ONLINE  ONLINE       node2                  STABLE
ora.chad
               ONLINE  ONLINE       node1                  STABLE
               ONLINE  ONLINE       node2                  STABLE
ora.net1.network
               ONLINE  ONLINE       node1                  STABLE
               ONLINE  ONLINE       node2                  STABLE
ora.net2.network
               ONLINE  OFFLINE      node1                  
               ONLINE  OFFLINE      node2                  
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.scan1.vip
      1        ONLINE  ONLINE       node2                  STABLE
ora.scan2.vip
      1        ONLINE  ONLINE       node1                  STABLE
ora.scan3.vip
      1        ONLINE  ONLINE       node1                  STABLE
ora.node1.vip
      1        ONLINE  ONLINE       node1                  STABLE
ora.node1_2.vip
      1        ONLINE  OFFLINE      node1                  
ora.node2.vip
      1        ONLINE  ONLINE       node2                  STABLE
ora.node2_2.vip
      1        ONLINE  OFFLINE      node2                  
--------------------------------------------------------------------------------

Okay, I cropped the output a bit and left the interesing things there for you to see.
So, we have a second Network going on in our clusterware and the VIPs from this network are offline.
Let’s try to start them up.


CRS-5006: Unable to automatically select a network interface which has subnet mask 255.254.0.0 and subnet number 101.18.0.0

Say what now?
Okay, what interfaces are currently running on my machine?
I see various interfaces, but the interesting one is this:


2: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
    inet 101.19.46.51/24 brd 101.19.46.255 scope global eth1
       valid_lft forever preferred_lft forever

You’Re asking why? Let me tell you.
Because this if we take a look at the network mentioned in our error message, this IP Adress falls into the range.
Just to let you know, the network with subnet number 101.18.0.0 and subnet mask 255.254.0.0 has the following range:

Host-IPs from: 101.18.0.1 to: 101.19.255.254

Okay…was there some kind of reconfiguration in the past? I don’t know, but let’s corect this one.
So, per documentation, the VIPs should have the following IP Adresses: 101.19.23.115 and 101.19.23.116

Let’s kick them out and create them anew

srvctl remove vip -vip node1_2
srvctl remove vip -vip node2_2
srvctl add vip -node node1 -address 101.19.23.115/255.255.255.0 -netnum 2
srvctl add vip -node node2 -address 101.19.23.116/255.255.255.0 -netnum 2

How is the current configuration in the clusterware?

srvctl config vip -vip node2_2
VIP exists: network number 2, hosting node node2
VIP IPv4 Address: 101.19.23.116
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:
srvctl config vip -vip node1_2
VIP exists: network number 2, hosting node node1
VIP IPv4 Address: 101.19.23.115
VIP IPv6 Address:
VIP is enabled.
VIP is individually enabled on nodes:
VIP is individually disabled on nodes:

Okay. Now it’s time to try and start one instance.

srvctl start instance -db ORCL -instance ORCL1 -startoption nomount

That worked and next let’s start rman and restore the REDO2 controlfile from REDO1

rman target /
restore controlfile from '+REDO1/ORCL/CONTROLFILE/current.265.970226155';
alter database mount;
alter database open;
exit

The last thing to do is clear the redo logs, just for funs sake.

Startup sqlplus and hit it:

ALTER DATABASE CLEAR LOGFILE GROUP 1;
ALTER DATABASE CLEAR LOGFILE GROUP 2;
ALTER DATABASE CLEAR LOGFILE GROUP 3;
ALTER DATABASE CLEAR LOGFILE GROUP 4;
ALTER DATABASE CLEAR LOGFILE GROUP 5;
ALTER DATABASE CLEAR LOGFILE GROUP 6;
ALTER DATABASE CLEAR LOGFILE GROUP 7;
ALTER DATABASE CLEAR LOGFILE GROUP 8;

If a redo log is still in use, set a checkpoint, switch the logfile an try again

alter system checkpoint;
alter system switch logfile;

And now you are back online from a wonderful journey through multiple errors.

Could I have done something better? Let me know.

Monat: Oktober 2022

Things that happen on a friday afternoon.

Shrink your Tablespace