查看原文
其他

震惊:Oracle 11gR2 RAC ADG 并没有高可用

JiekeXu JiekeXu DBA之路 2024-03-03

作者 | JiekeXu

来源 | JiekeXu之路(ID: JiekeXu_IT)

转载请联系授权 | (微信ID:xxq1426321293)

大家好,我是 JiekeXu,很高兴又和大家见面了,今天分享一篇 Oracle 11gR2 RAC ADG 并没有高可用的案例本文首发于微信公众号【JiekeXu之路】,欢迎点击上方蓝字关注我,标星或置顶,更多干货第一时间到达!

摘  要:上个月中的一次主机超过运行天数重启主机时发现 Oracle 11gR2 RAC ADG 并没有高可用,Linux 下 11204 的物理备库由于节点一主机关闭导致节点二异常关闭。


因 ADG 灾备环境在上个月时主机超过运行天数需要重启主机 RAC1 即节点一,便直接关闭了 GI 集群,理所当然本节点的数据库将正常关闭,但是却发现另一个节点数据库状态是 Mounted 的,这就有点奇怪了,下面来一起看看。

--检查集群状态全为 ONLINE,数据库为 Open,Readonly JiekeXu-rac1:/home/grid$crsctl stat res -t--------------------------------------------------------------------------------NAME TARGET STATE SERVER STATE_DETAILS --------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.ARCH.dg ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.DATA.dg ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.LISTENER.lsnr ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.OCR.dg ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.asm ONLINE ONLINE JiekeXu-rac1 Started ONLINE ONLINE JiekeXu-rac2 Started ora.gsd OFFLINE OFFLINE JiekeXu-rac1 OFFLINE OFFLINE JiekeXu-rac2 ora.net1.network ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.ons ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.registry.acfs ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 --------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE JiekeXu-rac1 ora.cvu 1 ONLINE ONLINE JiekeXu-rac1 ora.JiekeXuadg.db 1 ONLINE ONLINE JiekeXu-rac1 Open,Readonly 2 ONLINE ONLINE JiekeXu-rac2 Open,Readonly ora.JiekeXu-rac1.vip 1 ONLINE ONLINE JiekeXu-rac1 ora.JiekeXu-rac2.vip 1 ONLINE ONLINE JiekeXu-rac2 ora.oc4j 1 ONLINE ONLINE JiekeXu-rac1 ora.scan1.vip 1 ONLINE ONLINE JiekeXu-rac1 JiekeXu-rac1:/home/grid$JiekeXu-rac1:/home/grid$exitJiekeXu-rac1:/home/oracle$sqlplus / as sysdbaSQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 16 21:39:51 2020Copyright (c) 1982, 2013, Oracle. All rights reserved.Connected to:Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit ProductionWith the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,Data Mining and Real Application Testing optionsSQL> set linesize 600; col NAME for a20; col PCT_FREE for a30; select GROUP_NUMBER,NAME,TYPE,STATE,total_mb/1024 total_gb,free_mb/1024 free_gb,round((free_mb/total_mb)*100,2)||'%' pct_free from v$asm_diskgroup;SQL> SQL> SQL> GROUP_NUMBER NAME TYPE STATE TOTAL_GB FREE_GB PCT_FREE------------ -------------------- ------ ----------- ---------- ---------- ------------------------------ 1 OCR NORMAL MOUNTED 3 2.09570313 69.86% 2 ARCH EXTERN CONNECTED 100 21.8701172 21.87% 3 DATA EXTERN CONNECTED 200 12.4785156 6.24%--检查数据库磁盘使用率,主备同步情况SQL> set linesize 150; set pagesize 20; column name format a13; column value format a20; column unit format a30; column TIME_COMPUTED format a30; select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag'); SQL> SQL> SQL> SQL> SQL> SQL> NAME VALUE UNIT TIME_COMPUTED------------- -------------------- ------------------------------ ------------------------------transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 21:39:58apply lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 21:39:58SQL> set line 345 col instance_name for a15 col host_name for a30 select inst_id,INSTANCE_NAME,HOST_name,status,version,STARTUP_TIME from gv$instance; SQL> SQL> SQL> INST_ID INSTANCE_NAME HOST_NAME STATUS VERSION STARTUP_TIME---------- --------------- ------------------------------ ------------ ----------------- ------------------- 1 JiekeXuadg1 JiekeXu-rac1 OPEN 11.2.0.4.0 2020-08-16 03:36:59 2 JiekeXuadg2 JiekeXu-rac2 OPEN 11.2.0.4.0 2020-08-16 02:01:57
--检查数据库主机名,状态,启动时间,打开模式等等SQL> select name,DATABASE_ROLE,OPEN_MODE,DB_UNIQUE_NAME,GUARD_STATUS,CREATED from gv$database;
NAME DATABASE_ROLE OPEN_MODE DB_UNIQUE_NAME GUARD_S CREATED------------- ---------------- -------------------- ------------------------------ ------- -------------------JiekeXu PHYSICAL STANDBY READ ONLY WITH APPLY JiekeXuadg NONE 2018-10-09 09:48:19JiekeXu PHYSICAL STANDBY READ ONLY WITH APPLY JiekeXuadg NONE 2018-10-09 09:48:19--检查 MRP0 进程应用日志情况SQL> SELECT PROCESS, STATUS,SEQUENCE#,thread# FROM V$MANAGED_STANDBY;
PROCESS STATUS SEQUENCE# THREAD#--------- ------------ ---------- ----------ARCH CLOSING 4450 2ARCH CLOSING 4471 2ARCH CONNECTED 0 0ARCH CLOSING 4477 2ARCH CLOSING 4508 2ARCH CLOSING 4477 1ARCH CLOSING 4475 1ARCH CLOSING 4446 2ARCH CLOSING 4476 1ARCH CLOSING 4476 2MRP0 APPLYING_LOG 4513 2RFS IDLE 0 0RFS       IDLE                  0          0
13 rows selected.SQL> exit

查看应用连接,直接停止集群,然后 reboot

JiekeXu-rac1:/home/oracle$exitJiekeXu-rac1:~ # cd /app/product/11.2.0/grid/binJiekeXu-rac1:/app/product/11.2.0/grid/bin # ps -ef | grep LOCAL=NO | grep -v grep | awk '{print $2}' | wc -l 2JiekeXu-rac1:/app/product/11.2.0/grid/bin # ./crsctl stop crs CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.crsd' on 'JiekeXu-rac1'CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.ARCH.dg' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.OCR.dg' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.registry.acfs' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.JiekeXuadg.db' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.oc4j' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.cvu' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.cvu' on 'JiekeXu-rac1' succeededCRS-2672: Attempting to start 'ora.cvu' on 'JiekeXu-rac2'CRS-2677: Stop of 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.scan1.vip' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac1'CRS-2676: Start of 'ora.cvu' on 'JiekeXu-rac2' succeededCRS-2677: Stop of 'ora.ARCH.dg' on 'JiekeXu-rac1' succeededCRS-2677: Stop of 'ora.registry.acfs' on 'JiekeXu-rac1' succeededCRS-2677: Stop of 'ora.scan1.vip' on 'JiekeXu-rac1' succeededCRS-2672: Attempting to start 'ora.scan1.vip' on 'JiekeXu-rac2'CRS-2677: Stop of 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac1' succeededCRS-2672: Attempting to start 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac2'CRS-2676: Start of 'ora.scan1.vip' on 'JiekeXu-rac2' succeededCRS-2672: Attempting to start 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac2'CRS-2676: Start of 'ora.JiekeXu-rac1.vip' on 'JiekeXu-rac2' succeededCRS-2677: Stop of 'ora.oc4j' on 'JiekeXu-rac1' succeededCRS-2672: Attempting to start 'ora.oc4j' on 'JiekeXu-rac2'CRS-2676: Start of 'ora.LISTENER_SCAN1.lsnr' on 'JiekeXu-rac2' succeededCRS-2677: Stop of 'ora.JiekeXuadg.db' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.DATA.dg' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.DATA.dg' on 'JiekeXu-rac1' succeededCRS-2677: Stop of 'ora.OCR.dg' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.asm' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.asm' on 'JiekeXu-rac1' succeededCRS-2676: Start of 'ora.oc4j' on 'JiekeXu-rac2' succeededCRS-2673: Attempting to stop 'ora.ons' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.ons' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.net1.network' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.net1.network' on 'JiekeXu-rac1' succeededCRS-2792: Shutdown of Cluster Ready Services-managed resources on 'JiekeXu-rac1' has completedCRS-2677: Stop of 'ora.crsd' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.mdnsd' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.ctssd' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.evmd' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.asm' on 'JiekeXu-rac1'CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.ctssd' on 'JiekeXu-rac1' succeededCRS-2677: Stop of 'ora.evmd' on 'JiekeXu-rac1' succeededCRS-2677: Stop of 'ora.mdnsd' on 'JiekeXu-rac1' succeededCRS-2677: Stop of 'ora.drivers.acfs' on 'JiekeXu-rac1' succeededCRS-2677: Stop of 'ora.asm' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.cssd' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.cssd' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.crf' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.crf' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.gipcd' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.gipcd' on 'JiekeXu-rac1' succeededCRS-2673: Attempting to stop 'ora.gpnpd' on 'JiekeXu-rac1'CRS-2677: Stop of 'ora.gpnpd' on 'JiekeXu-rac1' succeededCRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'JiekeXu-rac1' has completedCRS-4133: Oracle High Availability Services has been stopped.JiekeXu-rac1:/app/product/11.2.0/grid/bin # Broadcast message from root (pts/0) (Fri Oct 16 21:45:46 2020):
The system is going down for reboot NOW!Connection to 192.16.11X.1XX closed by remote host.Connection to 192.16.11X.1XX closed.bye

重启主机后检查集群状态

JiekeXu-rac1:/home/grid$crsctl stat res -t --------------------------------------------------------------------------------NAME TARGET STATE SERVER STATE_DETAILS --------------------------------------------------------------------------------Local Resources--------------------------------------------------------------------------------ora.ARCH.dg ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.DATA.dg ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.LISTENER.lsnr ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.OCR.dg ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.asm ONLINE ONLINE JiekeXu-rac1 Started ONLINE ONLINE JiekeXu-rac2 Started ora.gsd OFFLINE OFFLINE JiekeXu-rac1 OFFLINE OFFLINE JiekeXu-rac2 ora.net1.network ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.ons ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 ora.registry.acfs ONLINE ONLINE JiekeXu-rac1 ONLINE ONLINE JiekeXu-rac2 --------------------------------------------------------------------------------Cluster Resources--------------------------------------------------------------------------------ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE JiekeXu-rac2 ora.cvu 1 ONLINE ONLINE JiekeXu-rac2 ora.JiekeXuadg.db      1        ONLINE  OFFLINE                               Instance Shutdown                    2 ONLINE INTERMEDIATE JiekeXu-rac2 Mounted (Closed) ora.JiekeXu-rac1.vip 1 ONLINE ONLINE JiekeXu-rac1 ora.JiekeXu-rac2.vip 1 ONLINE ONLINE JiekeXu-rac2 ora.oc4j 1 ONLINE ONLINE JiekeXu-rac2 ora.scan1.vip 1 ONLINE ONLINE JiekeXu-rac2

启动节点一数据库,应用 MRP0 进程检查同步情况

JiekeXu-rac1:/home/grid$exitlogoutJiekeXu-rac1:~ # su - oracle JiekeXu-rac1:/home/oracle$JiekeXu-rac1:/home/oracle$sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 16 22:13:25 2020
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startupORACLE instance started.
Total System Global Area 7516033024 bytesFixed Size 2267872 bytesVariable Size 1996490016 bytesDatabase Buffers 5502926848 bytesRedo Buffers 14348288 bytesDatabase mounted.Database opened.SQL> set linesize 150; set pagesize 20; column name format a13; column value format a20; column unit format a30; column TIME_COMPUTED format a30; select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag'); SQL> SQL> SQL> SQL> SQL> SQL> NAME VALUE UNIT TIME_COMPUTED------------- -------------------- ------------------------------ ------------------------------transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:13:58apply lag day(2) to second(0) interval 10/16/2020 22:13:58
SQL> alter database recover managed standby database using current logfile disconnect from session;
Database altered.
SQL> select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
NAME VALUE UNIT TIME_COMPUTED------------- -------------------- ------------------------------ ------------------------------transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:14:31apply lag day(2) to second(0) interval 10/16/2020 22:14:31
SQL> SELECT PROCESS, STATUS,SEQUENCE#,thread# FROM V$MANAGED_STANDBY;
PROCESS STATUS SEQUENCE# THREAD#--------- ------------ ---------- ----------ARCH CONNECTED 0 0ARCH CONNECTED 0 0ARCH CONNECTED 0 0ARCH CONNECTED 0 0ARCH CONNECTED 0 0ARCH CONNECTED 0 0ARCH CONNECTED 0 0ARCH CONNECTED 0 0ARCH CONNECTED 0 0ARCH CONNECTED 0 0RFS IDLE 0 0RFS IDLE 4516 1RFS IDLE 4516 2RFS IDLE 0 0MRP0 APPLYING_LOG 4516 2
15 rows selected.
SQL> select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag');
NAME VALUE UNIT TIME_COMPUTED------------- -------------------- ------------------------------ ------------------------------transport lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:14:39apply lag +00 00:00:00 day(2) to second(0) interval 10/16/2020 22:14:39
SQL> SQL> exitDisconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit ProductionWith the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,Data Mining and Real Application Testing options

登陆节点二查看数据库状态也是变成了 MOUNTED 这里并没有做任何操作,实在是奇怪吧。

JiekeXu-rac1:/home/oracle$ssh JiekeXu-rac2Last login: Sun Aug 16 02:34:33 2020 from JiekeXu-rac1JiekeXu-rac2:/home/oracle$sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 16 21:56:04 2020
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit ProductionWith the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,Data Mining and Real Application Testing options
SQL> set line 345 col instance_name for a15 col host_name for a30 select inst_id,INSTANCE_NAME,HOST_name,status,version,STARTUP_TIME from gv$instance; SQL> SQL> SQL> INST_ID INSTANCE_NAME HOST_NAME STATUS VERSION STARTUP_TIME---------- --------------- ------------------------------ ------------ ----------------- ------------------- 2 JiekeXuadg2 JiekeXu-rac2 MOUNTED 11.2.0.4.0 2020-08-16 02:01:57         1 JiekeXuadg1      JiekeXu-rac1         OPEN         11.2.0.4.0        2020-10-16 22:13:29         --这里数据库是 MOUNTED 由之前的 open 直接变成了 MOUNTED。--故在这里说一句,众所周知数据库启动过程为started、mounted、open 三阶段,--那么关闭过程是对应有三阶段的过程 close 到 MOUNTED,dismount 到 STARTED 然后 shutdown 关闭实例。--如下所示,单点数据库实例关闭过程 SYS@JiekeXu>col HOST_NAME for a18SYS@JiekeXu>select INSTANCE_NAME,HOST_NAME,VERSION,STATUS from v$instance;
INSTANCE_NAME HOST_NAME VERSION STATUS---------------- ------------------ ----------------- ------------JiekeXu JiekeXu 11.2.0.4.0 OPEN
SYS@JiekeXu>alter database close;
Database altered.
SYS@JiekeXu>select INSTANCE_NAME,HOST_NAME,VERSION,STATUS from v$instance;
INSTANCE_NAME HOST_NAME VERSION STATUS---------------- ------------------ ----------------- ------------JiekeXu JiekeXu 11.2.0.4.0 MOUNTED
SYS@JiekeXu>alter database dismount;
Database altered.
SYS@JiekeXu>select INSTANCE_NAME,HOST_NAME,VERSION,STATUS from v$instance;
INSTANCE_NAME HOST_NAME VERSION STATUS---------------- ------------------ ----------------- ------------JiekeXu JiekeXu 11.2.0.4.0 STARTED
SYS@JiekeXu>shutdown;ORA-01507: database not mounted
ORACLE instance shut down.SYS@JiekeXu>               

这里节点二直接 open 数据库检查同步情况

SQL> alter database open;
Database altered.SQL> set line 345 col instance_name for a15 col host_name for a30 select inst_id,INSTANCE_NAME,HOST_name,status,version,STARTUP_TIME from gv$instance; SQL> SQL> SQL> INST_ID INSTANCE_NAME HOST_NAME STATUS VERSION STARTUP_TIME---------- --------------- ------------------------------ ------------ ----------------- ------------------- 2 JiekeXuadg2 JiekeXu-rac2 OPEN 11.2.0.4.0 2020-08-16 02:01:57 1 JiekeXuadg1 JiekeXu-rac1 OPEN 11.2.0.4.0 2020-10-16 22:13:29
SQL> set linesize 150; set pagesize 20; column name format a13; column value format a20; column unit format a30; column TIME_COMPUTED format a30; select name,value,unit,time_computed from v$dataguard_stats where name in ('transport lag','apply lag'); SQL> SQL> SQL> SQL> SQL> SQL> NAME VALUE UNIT TIME_COMPUTED------------- -------------------- ------------------------------ ------------------------------transport lag +00 00:00:00 day(2) to second(0) interval 10/17/2020 00:52:52apply lag +00 00:00:00 day(2) to second(0) interval 10/17/2020 00:52:52
--查看数据库补丁 PSU 11.2.0.4.180717SQL> set line 150SQL> col ACTION_TIME for a30SQL> col ACTION for a8col NAMESPACE for a8col VERSION for a10col BUNDLE_SERIES for a5col COMMENTS for a20select * from dba_registry_history;
ACTION_TIME ACTION NAMESPAC VERSION ID BUNDL COMMENTS------------------------------ -------- -------- ---------- ---------- ----- --------------------24-AUG-13 12.03.45.119862 PM APPLY SERVER 11.2.0.4 0 PSU Patchset 11.2.0.2.009-OCT-18 09.49.44.442749 AM APPLY SERVER 11.2.0.4 180717 PSU PSU 11.2.0.4.180717

下面则通过 alert 日志查看问题

--节点 2 的 alert 日志
Fri Oct 16 21:41:17 2020Reconfiguration started (old inc 12, new inc 14)List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned outFri Oct 16 21:41:17 2020 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resourcesReconfiguration complete
--集群重配完成后实例直接 crashFri Oct 16 21:41:18 2020Recovery session aborted due to instance crashClose the database due to aborted recovery sessionSMON: disabling tx recoveryAll dispatchers and shared servers shutdownCLOSE: killing server sessions.CLOSE: all sessions shutdown successfully.SMON: disabling tx recoverySMON: disabling cache recoveryFri Oct 16 21:41:29 2020RFS[4]: Possible network disconnect with primary databaseFri Oct 16 21:41:32 2020idle dispatcher 'D000' terminated, pid = (27, 1)Fri Oct 16 21:41:50 2020RFS[5]: Possible network disconnect with primary databaseFri Oct 16 21:42:17 2020Decreasing number of real time LMS from 2 to 0Fri Oct 16 21:47:00 2020Primary database is in MAXIMUM PERFORMANCE modeRFS[8]: Assigned to RFS process 27227RFS[8]: Selected log 5 for thread 1 sequence 4514 dbid 1813902435 branch 989056101Fri Oct 16 21:47:01 2020Primary database is in MAXIMUM PERFORMANCE modeRFS[9]: Assigned to RFS process 27236RFS[9]: Selected log 9 for thread 2 sequence 4514 dbid 1813902435 branch 989056101Fri Oct 16 21:47:01 2020RFS[10]: Assigned to RFS process 27264RFS[10]: Selected log 8 for thread 2 sequence 4513 dbid 1813902435 branch 989056101Fri Oct 16 21:47:02 2020Archived Log entry 5708 added for thread 2 sequence 4513 ID 0x6fa9692f dest 1:Fri Oct 16 21:47:54 2020Archived Log entry 5709 added for thread 1 sequence 4514 ID 0x6fa9692f dest 1:Fri Oct 16 21:47:54 2020RFS[8]: Selected log 5 for thread 1 sequence 4515 dbid 1813902435 branch 989056101Fri Oct 16 21:47:55 2020RFS[9]: Selected log 8 for thread 2 sequence 4515 dbid 1813902435 branch 989056101Fri Oct 16 21:47:55 2020Archived Log entry 5710 added for thread 2 sequence 4514 ID 0x6fa9692f dest 1:Fri Oct 16 21:52:25 2020Reconfiguration started (old inc 14, new inc 16)List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablishedFri Oct 16 21:52:25 2020 * domain 0 valid = 0 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned outFri Oct 16 21:52:25 2020 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survivedFri Oct 16 21:52:25 2020 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resourcesReconfiguration completeFri Oct 16 21:52:32 2020Managed Standby Recovery starting Real Time ApplyFri Oct 16 21:53:25 2020Increasing number of real time LMS from 0 to 2Reconfiguration started (old inc 16, new inc 18)List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned outFri Oct 16 21:53:34 2020Fri Oct 16 21:53:34 2020 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resourcesReconfiguration completeFri Oct 16 21:53:51 2020RFS[11]: Assigned to RFS process 30085RFS[11]: Selected log 6 for thread 1 sequence 4513 dbid 1813902435 branch 989056101Fri Oct 16 21:53:53 2020Archived Log entry 5711 added for thread 1 sequence 4513 ID 0x6fa9692f dest 1:Fri Oct 16 21:54:35 2020Decreasing number of real time LMS from 2 to 0Reconfiguration started (old inc 18, new inc 20)List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablishedFri Oct 16 21:54:44 2020 * domain 0 valid = 0 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned outFri Oct 16 21:54:44 2020 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survivedFri Oct 16 21:54:44 2020 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resourcesReconfiguration completeFri Oct 16 21:54:51 2020Managed Standby Recovery starting Real Time ApplyFri Oct 16 21:54:54 2020Archived Log entry 5712 added for thread 1 sequence 4515 ID 0x6fa9692f dest 1:Fri Oct 16 21:54:55 2020Archived Log entry 5713 added for thread 2 sequence 4515 ID 0x6fa9692f dest 1:Fri Oct 16 21:55:33 2020Managed Standby Recovery starting Real Time ApplyFri Oct 16 21:55:33 2020Reconfiguration started (old inc 20, new inc 22)List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablishedFri Oct 16 21:55:33 2020 * domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned outFri Oct 16 21:55:33 2020 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survivedFri Oct 16 21:55:33 2020 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resourcesReconfiguration completeFri Oct 16 21:56:23 2020alter database openPicked Lamport scheme to generate SCNsFri Oct 16 21:56:23 2020SMON: enabling cache recoveryNo Resource Manager plan activePhysical standby database opened for read only access.Completed: alter database openFri Oct 16 21:56:34 2020Increasing number of real time LMS from 0 to 2Fri Oct 16 23:58:42 2020

从日志中发现 "Close the database due to aborted recovery session"  节点二数据库 close 是因为节点一 recover session 终止了,下面查看节点一的日志直接 abort 了,没有相信的日志信息。

节点 1 的 alert 日志--21:41:16 节点一直接 abort 没有相信日志信息,相当于异常断电 Fri Oct 16 21:41:16 2020Shutting down instance (abort)License high water mark = 13USER (ospid: 26959): terminating the instanceInstance terminated by USER, pid = 26959Fri Oct 16 21:41:18 2020Instance shutdown complete
-- 22:11:08 数据库再一次启动,大页相关配置信息,可见这里并没有配置,不过后面有推荐设置 3601Fri Oct 16 22:11:08 2020Starting ORACLE instance (normal)************************ Large Pages Information *******************Per process system memlock (soft) limit = UNLIMITED
Total Shared Global Region in Large Pages = 0 KB (0%)
Large Pages used by this instance: 0 (0 KB)Large Pages unused system wide = 0 (0 KB)Large Pages configured system wide = 0 (0 KB)Large Page size = 2048 KB
RECOMMENDATION: Total System Global Area size is 7202 MB. For optimal performance, prior to the next instance restart: 1. Increase the number of unused large pages by at least 3601 (page size 2048 KB, total size 7202 MB) system wide to get 100% of the System Global Area allocated with large pages********************************************************************LICENSE_MAX_SESSION = 0LICENSE_SESSIONS_WARNING = 0Initial number of CPU is 4Private Interface 'eth1:1' configured from GPnP for use as a private interconnect. [name='eth1:1', type=1, ip=169.254.111.0, mac=00-50-56-b9-xx-xx, net=10.xx.xx.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]Public Interface 'eth0' configured from GPnP for use as a public interface. [name='eth0', type=1, ip=10.xx.xx.8, mac=fa-16-3e-77-xx-xx, net=10.xx.xx.0/24, mask=255.255.255.0, use=public/1]Public Interface 'eth0:1' configured from GPnP for use as a public interface. [name='eth0:1', type=1, ip=10.xx.xx.1, mac=fa-16-3e-77-xx-xx, net=10.xx.xx.0/24, mask=255.255.255.0, use=public/1]Picked latch-free SCN scheme 3
--这里全部都是数据库参数相关设置Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0SYS auditing is disabledStarting up:Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit ProductionWith the Partitioning, Real Application Clusters, OLAP, Data Miningand Real Application Testing options.ORACLE_HOME = /app/product/11.2.0/dbSystem name: LinuxNode name: JiekeXu-rac1Release: 3.0.76-0.11-defaultVersion: #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990)Machine: x86_64VM name: VMWare Version: 6Using parameter settings in server-side pfile /app/product/11.2.0/db/dbs/initJiekeXuadg1.oraSystem parameters with non-default values: processes = 2000 sessions = 3024 event = "28401 TRACE NAME CONTEXT FOREVER, LEVEL 1" event = "10949 TRACE NAME CONTEXT FOREVER" spfile = "+DATA/JiekeXuadg/parameterfile/spfile.256.1046662725" sga_target = 7200M control_files = "+DATA/JiekeXuadg/controlfile/current.257.995197687" db_file_name_convert = "+DATA" db_file_name_convert = "+DATA" log_file_name_convert = "+DATA" log_file_name_convert = "+DATA" db_block_size = 8192 compatible = "11.2.0.4.0" log_archive_dest_1 = "location=+ARCH valid_for=(all_logfiles,all_roles) db_unique_name=JiekeXuadg" log_archive_dest_2 = "service=JiekeXu LGWR ASYNC valid_for=(online_logfiles,primary_role) db_unique_name=JiekeXu" log_archive_dest_state_1 = "enable" log_archive_dest_state_2 = "enable" fal_client = "JiekeXuadg" fal_server = "JiekeXu" log_archive_config = "dg_config=(JiekeXuadg,JiekeXu)" log_archive_format = "%t_%s_%r.dbf" log_archive_max_processes= 10 db_files = 1024 cluster_database = TRUE db_create_file_dest = "+DATA" db_recovery_file_dest = "" db_recovery_file_dest_size= 30G standby_file_management = "AUTO" thread = 1 undo_tablespace = "UNDOTBS1" undo_retention = 1440 instance_number = 1 remote_login_passwordfile= "EXCLUSIVE" db_domain = "" service_names = "JiekeXu" dispatchers = "(PROTOCOL=TCP) (SERVICE=JiekeXuXDB)" remote_listener = "db-cluster-scan:1521" session_cached_cursors = 100 job_queue_processes = 15 audit_file_dest = "/app/oracle/admin/JiekeXu/adump" audit_trail = "NONE" db_name = "JiekeXu" db_unique_name = "JiekeXuadg" open_cursors = 500 pga_aggregate_target = 2397M deferred_segment_creation= TRUE enable_ddl_logging = TRUE diagnostic_dest = "/app/oracle"Cluster communication is configured to use the following interface(s) for this instance 169.254.111.0cluster interconnect IPC version:Oracle UDP/IP (generic)IPC Vendor 1 proto 2Fri Oct 16 22:11:16 2020PMON started with pid=2, OS id=8557 Fri Oct 16 22:11:16 2020PSP0 started with pid=3, OS id=8559 Fri Oct 16 22:11:17 2020VKTM started with pid=4, OS id=8748 at elevated priorityVKTM running at (1)millisec precision with DBRM quantum (100)ms

这种情况还真是比较少见的,于是只能通过搜索引擎来查看相关问题了。MOS 中 Doc ID 1357597.1 给出了明确解释:

In an Active Data Guard RAC standby, if the redo apply instance crashes in the middle of media recovery, it leaves the RAC cache fusion locks on the surviving instances and the data files on disk in an in-flux state. In such a situation, queries on the surviving instances can potentially see inconsistent data. To resolve this in-flux state, the entire standby database is closed. Upon opening the first instance after such a close, the buffer caches and datafiles are made consistent again.翻译下来就是:在 Active Data Guard RAC 备库中,如果 redo apply 实例在恢复过程中崩溃,它会使幸存实例和磁盘上的数据文件上的 RAC 缓存融合锁定处于通量状态。在这种情况下,对幸存实例的查询可能会看到不一致的数据。要解决此处于流动状态,将关闭整个备用数据库。在关闭后打开第一个实例时,缓冲区缓存和数据文件再次保持一致。

大概意思就是说,如果 apply redo 应用日志的实例进程异常终止后,其它所有OPEN READ ONLY 的实例会 close, 因为在 RAC ADG 环境中,如果实例在应用日志过程中中断 crash, 会把CACHE FUSION 的锁留到残留幸存的实例中,会导致数据查询不一致,因次需要关闭数据库,重新打开来保证 buffer cache 和 datafile 的一致状态。如果配置了 DG Broker 这个操作可以自动完成, 版本大于11.2.0.2,如果没有配置 DG broker,手动方式直接 open 就可以了,接着手动执行应用日志命令,不需要重启实例,继续在幸存的节点上应用日志。


在墨天轮中,Anbob 大神写到从 12.1 版本引入了新特性 ”ADG instance recovery” ,解决的是当 redo apply instance crash 时,影响其它实例也 close 问题,从 12.1 以后保存下来的 ADG 实例会自动做 adg instance recovery ,保证数据一致性。11204 中要安装了较新的 PSU,修复了 bug 18331944 和 19516448,同时再配置隐藏参数””_adg_instance_recovery=TRUE””。关于隐含参数介绍可查看我之前的一篇文章:

Oracle 隐含参数查看
https://www.modb.pro/db/24621
SQL> col name for a30SQL> col value for a30SQL> col DESCRIPTION for a70SQL> set line 234SQL> select a.ksppinm name, b.ksppstvl value,a.ksppdesc description from x$ksppi a, x$ksppcv b where a.indx = b.indx and a.ksppinm like '_adg_instance_recovery';
NAME VALUE DESCRIPTION------------------------------ ------------------------------ ----------------------------------------------------------------------_adg_instance_recovery         FALSE                          enable ADG instance recovery


EXPLANATION--------------
Before this feature (prior to 12.1), when the ADG apply instance crashed in the middle of applying changes (standby media recovery session is on-going), all remaining open instances will be closed.
But from 12.1 when the apply instance crashed in the middle of applying changes, one of the remaining open instances will be automatically posted to do "ADG instance recovery", after the ADG instance recovery.We can see this, ADG instance recovery by checking the alert log, for the messages like "Beginning ADG Instance Recovery" and "Completed ADG Instance Recovery". If DG broker is enabled then Broker will start the MRP on any of the surviving instances.
Please note that the new feature is enabled by default in 12.1. We backported it to 11.2.0.4, you can enable it on 11.2.0.4 by setting "_adg_instance_recovery=TRUE". The default behaviour on 11.2.0.4 is to close all remaining open instances
翻译:在此特性之前(12.1之前),当 ADG apply 实例在应用更改的过程中崩溃(备用媒体恢复会话正在进行)时,所有剩余的打开实例将被关闭。 但是从 12.1 开始,当应用实例在应用更改的过程中崩溃时,在 ADG 实例恢复之后,剩余的一个打开的实例将被自动提交去做“ADG实例恢复”。我们可以通过查看警报日志看到 ADG 实例恢复,对于“开始ADG实例恢复”和“完成ADG实例恢复”这样的消息。如果 DG broker 是启用的,那么 broker 将在任何存活的实例上启动 MRP。请注意,12.1 中的新特性是默认启用的。我们将它向后移植到 11.2.0.4,您可以通过设置 “_adg_instance_recovery=TRUE” 在 11.2.0.4 上启用它。 11.2.0.4 的默认行为是关闭所有其余打开的实例。

下面是一个 MOS 上测试案例分享一下:

Test case------------
I have 2 node primary and 3 node physical standby(instance number 2,4 and 5).
MRP running on instance 2 and other instances are in open read only.
1. On instance 2,

SQL> select d.open_mode,i.instance_number from v$database d, v$instance i;
OPEN_MODE INSTANCE_NUMBER-------------------- ---------------READ ONLY WITH APPLY 2
SQL> select process,status,sequence#,thread# from v$managed_standby where process like '%MRP%';

PROCESS STATUS SEQUENCE# THREAD#--------- ------------ ---------- ----------MRP0 WAIT_FOR_LOG 22 1
SQL> show parameter _adg
NAME TYPE VALUE------------------------------------ ----------- ------------------------------_adg_instance_recovery boolean TRUE
2. On instance 4,
SQL> select process,status,sequence#,thread# from v$managed_standby where process like '%MRP%';
no rows selected
SQL> show parameter _adg
NAME TYPE VALUE------------------------------------ ----------- ------------------------------_adg_instance_recovery boolean TRUE
3. Shut abort instance 2.
SQL> shut abortORACLE instance shut down.SQL> exit
4. Instance 4 still open_read only.
SQL> select d.open_mode,i.instance_number from v$database d, v$instance i;
OPEN_MODE INSTANCE_NUMBER-------------------- ---------------READ ONLY 4 NOTE :1. As said before if DG broker configured then Broker will start MRP on any one of the available instances.2. If you want to use ADG instance recovery feature on 11.2.0.4 (including 11.2.0.4 BP or PSUs), please make sure the following patches for following fixes are present. The ADG instance recovery feature is not usable without those fixes.- Bug fix for 18331944.- Bug fix for 19516448.

当然,我的生产灾备环境是没有相关补丁和参数设置的,PSU 11.2.0.4.180717 过旧,故会产生此类问题。


参考链接:

https://www.modb.pro/db/32021

Active Data Guard RAC Standby – Apply Instance Node Failure Impact (Doc ID 1357597.1)

Behavior of Active Dataguard(ADG) When Apply Node Aborts/Crash (Doc ID 1613719.1)

今天的分享就到这里了,如果本文对您有一丁点儿帮助,请多支持“在看”与转发,不求小费了哪怕是一个小小的赞,您的鼓励都将是我熬夜写文章最大的动力,让我有一直写下去的动力,最后一起加油,奥利给



Oracle 12c 及以上版本补丁更新说明及下载方法(收藏版)

Oracle 11.2.0.4 RAC 最新补丁下载(11.2.0.4.200714)

案例分享|Oracle 11g RAC 数据库连接数过高处理办法

11g RAC 在线存储迁移实现 OCR 磁盘组完美替换

如何通过 Shell 监控异常等待事件和活跃会话 

我的 OCM 之路|书写无悔青春追梦永不止步

Oracle 19c 之多租户 PDB 连接与访问(三)

案例:RMAN 备份控制文件报错 ORA-00230

Oracle 12C 最新补丁下载与安装操作指北

DBA 常用的软件工具有哪些(分享篇)?

Oracle 11g 临时表空间管理

Oracle 每日一题系列合集


一键三连“分享、在看与点赞”,给我充点儿电吧~


继续滑动看下一个

震惊:Oracle 11gR2 RAC ADG 并没有高可用

JiekeXu JiekeXu DBA之路
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存