Dealing With Conflicts - Part One

A few months back one of our team came to me with questions around conflicts in logical replication, and Dbvisit Replicate’s built-in conflict handling capabilities. And while I work with the application every day (and we have a lot of material that I can refer to when needed!) the level of detail I needed to provide in this instance wasn’t top of my mind. So these questions forced me to go back and look in detail at this again. And while we have written on the topic before I hope that the information presented here might also be helpful to others.

Ok so, first of all, what is a conflict?

Very simply put this is a “situation” (an abnormal condition) that occurs on a target database that is being kept up to date by a logical replication mechanism. Changes attempted by the replication engine generate an error or, more commonly, the reconstructed SQL statements are unable to be applied there for some reason; ignoring this or forcing the changes through could cause the target data to diverge from that on the source database it is partnered with - and so trapping and processing these events is crucial! For large data sets; instantiating or baselining data, between source and target (a necessary step before starting the ongoing replication process) can be a time consuming exercise. Once configured you want to minimise any chance of the data skewing between the two sides. Furthermore, if the target database is used for reports or analysis then it is vital that this information accurately reflects the data back on the production system. For these reasons Dbvisit Replicate was built, from the ground up, with full conflict handling and resolution mechanisms. By default Dbvisit Replicate is very cautious when it comes to dealing with these conflict events - flagging them to the end user, pausing and requiring an administrator to review and decide on what is to be done next. Sometimes untangling conflict scenarios can be a complex affair, so oftentimes an assessment needs be made by someone with working knowledge of the character and nature of the data in the system itself. However, Dbvisit Replicate offers a host of mechanisms to enable the administrator to quickly process through these conflicts. In this first post we provide some of the background context, laying the platform for understanding how logical replication works. In part 2 we will take a look at the options for manual resolution with Dbvisit Replicate, and then in the third, detail the application’s inbuilt handlers and the options offered in terms of automatic conflict resolution.

The basics of logical replication.

In setting the foundation for this discussion it is important to understand the basics of Oracle logical database replication. The key takeaway here is that the changes made on a source database are actually not the same as that performed on the target - although the net result is equivalent. So why is this? It comes down to Oracle’s redo mechanism – and even for DBA’s who have worked with Oracle for a long time this can come as a surprise. As John Watson helpfully explains, redo is physically rather than logically, oriented. It is a record of the change vectors (database changes) related to physical locations in blocks, by the row ID pointer. Because it is a record of changes made at the block level (instructions for ‘replaying’ the modifications made to a block), single sql statements, which frequently affect more than row/record/block (batch) are broken down to this lower level of granularity. So, for example, a simple update statement on a table that affects more than one row, will actually be broken out into discrete, individual lines - and any values affected then hard-coded in the redo record responsible for reconstructing this change. Take the following table: Running the following simple update will make changes to 2 records:

update HR.EMPLOYEES set rating = 4 where department = 'HR'

This logical statement is broken down into two discrete sql statements in the redo stream, with values explicitly stated. In terms of logical replication it is this information which is interpreted, shipped and replayed on the target database - and the net effect of running both these statements on that target is the same as executing the original single sql statement on the source database. We can examine these redo records in a number of different ways, such as dumping the redo log to trace or utilising Oracle’s Log Miner, and the following extracts are output taken from Dbvisit Replicate’s parsed PLOG files. Here you can see the resultant individual redo records, and ascribed values contained with them:

SEQ# = 10004114 / FRAG# = 0 (last frag)- TIME = 2015.12.08 22:48:22- TIME(THL) = 2015.12.08 23:14:00- SCHEMA = HR- TRANSACTION = 000a.00c.0000062e- SQL(0) = update "HR"."EMPLOYEES" set "RATING" = 4 where (1=1) and "RATING" = 3 and "ID" = 1  - ACTION = UPDATE  - SCHEMA = HR  - TABLE = EMPLOYEES  - ROW# = 0  - COL(:1) = 4  - COL(:2) = 3  - COL(:3) = 1   SEQ# = 10004116 / FRAG# = 0 (last frag) - TIME = 2015.12.08 22:48:22 - TIME(THL) = 2015.12.08 23:14:00 - SCHEMA = HR - TRANSACTION = 000a.00c.0000062e - SQL(0) = update "HR"."EMPLOYEES" set "RATING" = 4 where (1=1) and "RATING" = 5 and "ID" = 3  - ACTION = UPDATE  - SCHEMA = HR  - TABLE = EMPLOYEES  - ROW# = 0  - COL(:1) = 4  - COL(:2) = 5  - COL(:3) = 3

   Dbvisit Replicate - categories of conflicts.

Now that we have set the scene, as it were, we move to the context of working with Dbvisit Replicate – and in this we identify four basic categories of conflict scenarios, as discussed in the online user guide:

  1. SQL affects zero rows

.

This causes data divergence and so a conflict is raised. In this a SQL statement we are attempting to apply to the target database does not find a match, according to the hard coded values (the where clause parameters) that existed when it was generated on the source; so it cannot be processed through. One common cause of this might be that the actual target data (those rows we are trying to apply the changes to) have been changed independently of the source data. With a logical replication system (unlike a physical standby database) this is possible as the target database is an open read/write entity, unless otherwise restricted. 

  1. SQL affects more than one row.

This causes data divergence and so a conflict is raised. The opposite of #1 - here a statement finds multiple records will be affected by a sql statement to be run on the target, when it altered only a single row back on the source. This can occur, for example, if a target table has constraints disabled (PK is “off”) and duplicate rows contained within those tables.

  1. Oracle error.

This can range from usual primary key or foreign key violation (another type of data divergence) to purely technical reasons (cannot extend datafile).

  1. Lock timeout

.

If Dbvisit Replicate’s APPLY process waits for a row lock longer than the duration specified by the WATCHDOG_TIMEOUT parameter, a conflict is also reported.

Conflict detection.

Although it may seem counterintuitive to say so (especially if you hit them in production) a key point to make here is that identification of a conflict is a positive outcome. What has been identified is a condition which could have otherwise “corrupted” your target data set in some way, skewing it from the source, and creating data divergence. Letting these occurrences slide through, undetected, may minimise additional administration in the short term (not withstanding that applications and systems which are generating excessive conflicts should be examined thoroughly) but will ultimately hurt you in the end. Catching a conflict provides you with an opportunity to identify the root cause of the issue, and process through, keeping your data set ‘clean’.

Summary.

In the next post in this series we will turn our attention to the first two of these conflict categories, which are more common and perhaps more data related than #3 and #4, and discuss the options for manually resolving these conflicts with Dbvisit Replicate.