Monday, August 17, 2015

IPC Send timeout detected - The Story of a Database

Some errors make us panic, nervous, and beyond belief when we positively attempt to resolve a panic situation. Specific to case of Clusterware where it attempts to resolve a conflict by removing dead resources from the cluster but it suffers because of lack of additional resources to complete the task, and makes the whole situation even worst so everything that has in contact with the key resource also suffers. The key resource what we are talking is a RAC Database and the Clusterware under a panic situation is Oracle Clusterware.

Here's the environment
  1. Oracle RDBMS 11.2.0.4.4
  2. Oracle GI 11.2.0.4.4
  3. 3 Node RAC Database with vault function enabled
  4. NetApp NFS Filer based Storage


What was the issue?


On Instance 3
I have noticed "IPC Send timeout".
Fri Aug 14 04:46:05 2015
IPC Send timeout detected. Receiver ospid 20545 [
Fri Aug 14 04:49:55 2015
And then, Oracle has started suspending MMON actions one by one as we noticed below messages in the log
Fri Aug 14 10:49:07 2015
Suspending MMON action 'AWR Auto CPU USAGE Task' for 82800 seconds
Fri Aug 14 07:22:11 2015
Suspending MMON action 'tablespace alert monitor' for 82800 seconds

On Instance 1
Suspending MMON slave action kttetslv_ for 82800 seconds
Fri Aug 14 07:57:53 2015
IPC Send timeout detected. Sender: ospid 4812 [oracle@phyeimrac01-tst (J000)]
Receiver: inst 3 binc 4 ospid 20545
Fri Aug 14 08:03:11 2015

This usually occurs when there is high resource contention (Locks/CPU/Memory/Storage/Network), where processes queue increases while the demand for resource increases, and global enqueue and cache services stop responding to each other in a RAC environment. In the end, it gets to a situation where it can't make any actions except for being hung, and keep trying to resolve itself. Hence, we see "Inter Process Communication" Send timeout detected.

I came to know it was because of massive load on CPUs, found from AWR report. Here's the screenshot from AWR report.



How was it resolved?

Well, during the issue, I was completely helpless to see what's going on. Nothing worked period. srvctl/crsctl, well, nothing worked. Fortunately, a connection to SQLPLUS prompt worked on unaffected instances in the cluster i.e., instance 1 and instance 2 but again, GV$ views did not work. And, strange enough, Instance eviction was not happening at all while everything was just hanging. The real culprits, Application people keep complaining "Man, performance is terrible, please do something, it's not moving at all" :). How many times we hear this, well, quite often during benchmark phase of a development project. They keep changing the code left right center and think what they do will work. Eventually, it creates a nightmare for DBA.

As we have Golden Gate configured on node 1 (its VIP runs on node1), I first brought down all replicats and then brought down instance 1 and instance 2 using "SQLPLUS prompt by "SHUT ABORT" becuase srvctl does not respond at all. It makes sense because "srvctl" in clusterware environment is driven by clusterware for mutual benefit of clusterware resources. At this stage, Two down, one to go. Well, to bring down the instance 3, the only solution to reboot the node 3. So, communicated to the server team to reboot the node 3.

Here comes the real devil - "file 1 needs to be either taken out of backup mode or media recovered"

Node 3 was rebooted, you think we are all set now? Eh? Well, here came the devil, Database was not coming up, all instances were failing to open the Database with the error below.

Abort recovery for domain 0
Errors in file /u01/app/oracle/diag/rdbms/teim/teim1/trace/teim1_ora_11744.trc:
ORA-10873: file 1 needs to be either taken out of backup mode or media recovered
ORA-01110: data file 1: '/oracle/oradata/teim/data1/system01.dbf'
ORA-10873 signalled during: ALTER DATABASE OPEN /* db agent *//* {2:30594:28372} */... 

Because I still did not know what else has happened at this time, For a moment, I thought, what the heck I did, Technically, I have just aborted all instances and expected SMON to perform Instance Crash Recovery by replaying what's in redo (roll forward committed changes/rollback uncommitted changes, and make everything consistent), and it should be all set for operations again. But, why the heck did it ask me to perform a media recovery. Not even in my dreams could I have ever imagined this.

Ok, this had to be troubleshooted now. Panic over panic, eh? Don't be panic. Come on, Oracle!!

While digging into the issue
We use NetApp snapshots for Database backup, it is basically a traditional Oracle hot backup. Yes, the one that makes "redolog" life miserable :).  The script we have in place performs like this

  1. First put the Database in BEGIN BACKUP mode - Done 
  2. Make a log switch to archive - Done 
  3. "grep" for "oradata" from "df -h" and get the list of "Database file system" volumes - Hanging
  4. Create snapshots of each volume with the date - Forget it
  5. Make a log switch again to archive - Forget it
  6. End backup with "ALTER DATABASE END BACKUP" - Forget it

Well, the step 3 was hanging because there were some stale NFS mounts. This was the devil. So, the script did not manage to END the backup, and left all Database files in "ACTIVE BACKUP" mode.
We were not aware of this while resolving the other issue "IPC Send timeout detected". Of course, why did we think of the backup issue? Nope, we do not.

Alright, no need to be panic. follow the below to bring the Database up again

Bring the Database to mount state
$srvctl start database -d <db_unique_name> -o mount
End the backup
SQL>ALTER DATABASE END BACKUP;
Make sure none of the files in BACKUP mode
SQL>select * from gv$backup;
Bring the Database down
$srvctl stop database -d <db_unique_name>
Bring the Database up
$srvctl start database -d <db_unique_name>

Database is up, take a deep breath :)

I hope you have enjoyed reading this post as much as I have enjoyed resolving the issue. Please feel free to comment and share.

Thanks,
Naga