Things that happen on christmas morning

Hello friends of the Oracle Database Administration and Troubleshooting,

here’s another story for you. Once upon a time on christmas day (that would be December the 25th) I got a call from work because I was the Oracle DBA On-Call that day. Yay.

Usually On-Call is pretty calm with no calls on most days, but this christmas we had an unfortunate coincidence, that the Unix Patch Days in our company fell right upon Christmas, so I got a nice little Christmas Present which was a call, because on one database server the Oracle Grid/ASM Software didn’t start correctly.

So, my Christmas Present was to investigate as to why this happened. And boy did I have fun, as this was a mixture of something quite unusual and later on a small human error which lead to a second error that came up after I found and corrected the first one.

So after checking that some processes like the HAS Daemon where running, I took a look into the Alert Log of the ASM Instance and I found some quite interesting Error Messages in between some status messages.

2022-12-25T06:38:21.239192+01:00
Errors in file /u01/oracle/diag/asm/+asm/+ASM/trace/+ASM_rbal_11371.trc:
ORA-15183: ASMLIB initialization error [driver/agent not installed]
2022-12-25T06:38:21.239280+01:00
WARNING: FAILED to load library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so

SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs // {0:0:208} */
2022-12-25T06:38:21.326369+01:00
NOTE: Diskgroups listed in ASM_DISKGROUP are
ASMFR1

ASMDG1

2022-12-25T06:38:21.327622+01:00
NOTE: ASM instance +ASM is discoverable by local clients on node dbhost01
2022-12-25T06:38:21.336612+01:00
NOTE: cache registered group ASMDG1 1/0x41816392
NOTE: cache began mount (first) of group ASMDG1 1/0x41816392
NOTE: cache registered group ASMFR1 2/0x41816393
NOTE: cache began mount (first) of group ASMFR1 2/0x41816393
2022-12-25T06:38:21.371829+01:00
ERROR: no read quorum in group: required 2, found 0 disks
2022-12-25T06:38:21.372163+01:00
NOTE: cache dismounting (clean) group 1/0x41816392 (ASMDG1)
NOTE: messaging CKPT to quiesce pins Unix process pid: 11383, image: oracle@dbhost01 (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: LGWR not being messaged to dismount
NOTE: cache dismounted group 1/0x41816392 (ASMDG1)
NOTE: cache ending mount (fail) of group ASMDG1 number=1 incarn=0x41816392
NOTE: cache deleting context for group ASMDG1 1/0x41816392
2022-12-25T06:38:21.408229+01:00
GMON dismounting group 1 at 2 for pid 24, osid 11383
2022-12-25T06:38:21.409859+01:00
ERROR: diskgroup ASMDG1 was not mounted

So, okay. Failed to load library. Let’s check if the library exists and how the permissions are set:

[oracle@dbhost01:+ASM]/home/oracle: ls -la /opt/oracle/extapi/64/asm/orcl/1/libasm.so
-rwxr-xr-x 1 root root 32992 Feb 25 2020 /opt/oracle/extapi/64/asm/orcl/1/libasm.so
[oracle@dbhost01:+ASM]/home/oracle: ldd /opt/oracle/extapi/64/asm/orcl/1/libasm.so
linux-vdso.so.1 (0x00007fff144c5000)
libc.so.6 => /lib64/libc.so.6 (0x00007f3a0c3fe000)
/lib64/ld-linux-x86-64.so.2 (0x00007f3a0c9cc000)
[oracle@dbhost01:+ASM]/home/oracle:

Well, this looks quite good. So, what else could it be?
I had no idea, so I asked the well known Senior DBA (okay…I typed the error into a well known search engine).
The results haven’t been good. Well, yes. I obviously checked file permissions already, but then there was a hint towards the oracleasm tool to show the actual configuration. So I did:

[oracle@dbhost01:+ASM]/home/oracle: /usr/sbin/oracleasm configure
ORACLEASM_ENABLED=true
ORACLEASM_UID=
ORACLEASM_GID=
ORACLEASM_SCANBOOT=true
ORACLEASM_SCANORDER="dm"
ORACLEASM_SCANEXCLUDE="sd"
ORACLEASM_SCAN_DIRECTORIES=""
ORACLEASM_USE_LOGICAL_BLOCK_SIZE="false"
[oracle@dbhost01:+ASM]/home/oracle:

So, let’s compare this with another ASM installation.

[oracle@dbhost02:+ASM]/home/oracle: /usr/sbin/oracleasm configure
ORACLEASM_ENABLED=true
ORACLEASM_UID=5000
ORACLEASM_GID=300
ORACLEASM_SCANBOOT=true
ORACLEASM_SCANORDER="dm"
ORACLEASM_SCANEXCLUDE="sd"
ORACLEASM_SCAN_DIRECTORIES=""
ORACLEASM_USE_LOGICAL_BLOCK_SIZE="false"

Well, there is a slight difference between these two. On the second server UID and GID have been set to the appropriate values for the oracle user and dba group.
So, by calling the command „/usr/sbin/oracleasm configure -i“ I was able to set the needed values.
ASM was now able to start and find all necessary devices. A quick reboot of the server to check wether this is reboot-proof showed the ASM looked good.

But now we had another problem, because the Database did not want to start. So, let’s take a look at what’s going on there.

The Alert Log showed the following errors:

Sys-V shared memory will be used for creating SGA


2022-12-25T06:57:35.084474+01:00
Error: ENOSPC while creating shared memory segment of size 23555211264 bytes
Check the system shared memory parameters
2022-12-25T06:57:35.084533+01:00
Error: ENOSPC while creating shared memory segment of size 23555211264 bytes
Check the system shared memory parameters
.
.
.
2022-12-25T06:57:35.084803+01:00
PAGESIZE AVAILABLE_PAGES EXPECTED_PAGES ALLOCATED_PAGES ERROR(s)
2022-12-25T06:57:35.084830+01:00
4K Configured 14 0 ORA-27125
2022-12-25T06:57:35.084874+01:00
2048K 11265 11265 5 ORA-27125
2022-12-25T06:57:35.084916+01:00

2022-12-25T06:57:35.084958+01:00
SGA: Realm creation failed

Okay. What is going on here? Aren’t these enough hugepages?

[oracle@dbhost01:ORCL]/u01/oracle/ORCL/logs: cat /proc/meminfo | grep Huge
AnonHugePages: 784384 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 11275
HugePages_Free: 31
HugePages_Rsvd: 21
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 23091200 kB

Well…something is using the databases hugepages.
Let’s check the parameters.
The Databases Alert Log says, it want’s about 20G.

Starting ORACLE instance (normal) (OS id: 17486)
2022-12-25T07:09:20.609947+01:00


Instance SGA_TARGET = 22528 MB and SGA_MAX_SIZE = 22528 MB


That’s fine. One Hugepage is 2M. So we should have enough if we look at the values above.
There is no other database running on the machine, so where have they gone?

Well…not database exactly, but we have an ASM Instance.
Let’s check the SGA Parameter there:

SQL> show parameter sga

NAME TYPE VALUE


sga_max_size big integer 22G
sga_target big integer 22G

Wow…22G for an ASM Instance, that’s a lot. Let’s scale that down to 2G and restart the ASM.

alter system set sga_max_size=2G scope=spfile;
alter system set sga_target=2G scope=spfile;
shutdown immediate;
startup;

After succesful restart of the ASM instance we have now been able to restart the Database.

But how did this happen?
Well, as often is the case, human error. The longer you are working in IT the more often things like these will happen.
It was planned to change the parameter in the Database, but the DBA changed into the ASM environment by accident without really registering what they did and changed the parameters there. After restarting the database and seeing that seemingly nothing changed they changed the parameters again, this time being in the correct environment and thus having succes after a database restart.

So, this was fine until the Server rebooted and thus the ASM Instance restarted with a bigger SGA.

I hope you like these kind of stories as they might give you hints how to investigate similar problems.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert