How to Avoid Interruptions When You Patch Oracle Grid Infrastructure 19c

In previous blog posts, I have shown you have to patch Oracle Grid Infrastructure 19c (GI). When you patch GI, it happens in a rolling manner, node by node. The database is always up, but each instance is temporarily down.

If my application is connected to an instance going down, how do I ensure my application is not interrupted?

My demo system:

  • Is a 2-node RAC. The nodes are named copenhagen1 and copenhagen2.
  • Database is called CDB_COPENHAGEN
  • PDB is called SALES

Connect Using A Custom Service

  1. First, I create a dedicated service that my application must use:

    [oracle@copenhagen1]$ srvctl add service \
       -db CDB1_COPENHAGEN \
       -pdb SALES \
       -service SALESGOLD \
       -preferred CDB11,CDB12 \
       -failover_restore AUTO \
       -failoverretry 1 \
       -failoverdelay 3 \
       -commit_outcome TRUE \
       -failovertype AUTO \
       -replay_init_time 600 \
       -retention 86400 \
       -notification TRUE \
       -drain_timeout 300 \
       -stopoption IMMEDIATE
    
    • I want to use all nodes in my cluster for this service, so I set preferred to all instances of my database.
    • I set failover_restore AUTO to enable Transparent Application Continuity for the service.
    • When I stop the service, the sessions have 5 minutes to move to another instance as specified by drain_timeout.
    • If the sessions don’t drain in due time, the stopoption parameter tells that sessions are to be killed immediately.
  2. Now I start the service:

    [oracle@copenhagen1]$ srvctl start service \
       -db CDB1_COPENHAGEN \
       -service SALESGOLD
    
  3. I add an alias called SALESGOLD to tnsnames.ora:

    SALESGOLD =
    (DESCRIPTION =
       (CONNECT_TIMEOUT=90)(RETRY_COUNT=50)(RETRY_DELAY=3)(TRANSPORT_CONNECT_TIMEOUT=3)
       (ADDRESS_LIST =
          (LOAD_BALANCE=on)
          (ADDRESS = (PROTOCOL = TCP)(HOST=copenhagen-scan)(PORT=1521))
       )
       (CONNECT_DATA=
          (SERVICE_NAME = SALESGOLD)
       )
    )
    
    • I connect to SALESGOLD service using the SERVICE_NAME attribute.
    • I connect via the SCAN listener.
  4. I configure my application to connect via the SALESGOLD alias.

You should always configure your application to connect to a custom service. The database has default services, but you should not use them for your application. Custom services are the way to go if you want a highly available system.

Let’s Patch

It doesn’t matter whether you use in-place or out-of-place patching. If you are using out-of-place patching, do as much of the preparations as possible. This procedure starts right before a node goes down.

  1. First, I want to patch GI on copenhagen1.
  2. I stop my service SALESGOLD on that node only:
    [oracle@copenhagen1]$ srvctl stop service \
       -db CDB1_COPENHAGEN \
       -service SALESGOLD \
       -instance CDB11 \
       -force
    
    • I use the instance parameter to stop the service only on the node I am about to patch. In this case, CDB11 is the database instance name of my database CDB_COPENHAGEN running on copenhagen1.
    • GI marks the service as OFFLINE on copenhagen1, and new connections will be directed to the other node.
    • GI sends out a FAN event telling the instance is going down. The Oracle driver on the clients detects the event.
    • When I created the service, I set the drain timeout to 300. This gives database sessions 5 minutes to finish their work.
      • If desired, I can override the value using -drain_timeout.
    • If a session becomes idle before the timeout, the driver on the client automatically reconnects to the other node. The session continues on the other node; no error message is displayed.
    • If the session is still active after the timeout, it will be killed immediately. This I defined when I created the service using stopoption parameter.
      • I enabled Transparent Application Continuity (TAC) on my database service using the failover_restore parameter. TAC detects the session is killed and attempts to intervene.
      • If my application is fit for TAC, a new connection is made, any in-flight transaction is replayed in the new session; no error message is displayed.
      • If my application is not fit for TAC or if I was doing something that is not replayable, my session is terminated, and my application receives an error.
  3. Now, all sessions using SALESGOLD service have been drained from copenhagen1. I can bring down the GI stack to patch.
    • I use opatchauto apply for in-place patching.
    • Or, I use opatchauto apply -switch-clone for out-of-place patching.
    • Or, I use root.sh for out-of-place patching with switchGridHome.
    • Or, whatever method suits my environment. It doesn’t matter.
  4. When the GI stack is restarted, my service SALESGOLD is automatically restarted. Now, my application can connect to the instance on copehagen1 again.
  5. I ensure that the service is started:
    [grid@copenhagen1]$ crsctl stat resource -t
    
  6. I repeat steps 2-5 for the second node, copenhagen2.

My Application Can’t Use Application Continuity

If, for some reason, you are not ready to use Application Continuity or Tranparent Application Continuity yet, you can still use this procedure. You can change the failover_restore setting on the service to something that suits your application. You can still perform draining, but you need to find another method for dealing with those sessions that don’t drain in due time.

One option is to set the drain_timeout parameter high enough to allow everyone to finish their work and reconnect to another node. Just be aware that the other node must be able to handle the entire workload during that period.

If draining doesn’t work for you either (or the patch is non-rolling), you must take an outage to patch. You can follow the Minimum Downtime approach described in Rolling Patch – OPatch Support for RAC (Doc ID 244241.1):

  • Shut down GI stack on node 1
  • Patch node 1
  • Shut down GI stack on node 2
  • Patch node 2
  • Shut down GI stack on node 3
  • Start up node 1 and node 2
  • Patch node 3
  • Start up node 3

Happy Patching!

Appendix

Other Blog Posts in This Series

Further Reading

2 thoughts on “How to Avoid Interruptions When You Patch Oracle Grid Infrastructure 19c

  1. Hi Daniels. Thanks for the blog.
    One question. Do we need both settings in the service definition and in the tnsnames.ora?
    And second question. Does Application Continuity is the same as TAC in your demo?.

    Like

  2. Hi Williams,

    Thanks for the feedback. Much appreciated.

    In the service definition, you are turning “things” on. In the connect string (tnsnames.ora) you define certain timeout values. The critical part is the service definition. There are defaults for the timeouts in the connect string, but you should change them. The technical brief I am referencing has more details on the settings.

    AC should be able to do the same as TAC. First, I am not an expert on this, so please take that into account. With AC you have much more control over the replay when it happens. With TAC things are handled for you – but gives you less control – so it might not be suitable for all application. Again – have a look in the technical brief. It discusses the two features.

    Regards,
    Daniel

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s