Replication Troubleshooting: Log Reader Latency

1. Log Reader Latency script
The script provide below should be run on the publisher server and publishing database. The script will highlight to the DBAs on where the log reader is in the log and how far behind it is to catch up.

-- Find log reader agent position in time and latency

-- RUN ON PUBLISHER SERVER

USE csn_product -- Set publication database

DECLARE @LSN NVARCHAR(25)

DECLARE @SPID INT

IF OBJECT_ID('tempdb.dbo.#temp_dbcc_opentran') IS NOT NULL

DROP TABLE #temp_dbcc_opentran

CREATE TABLE #temp_dbcc_opentran (RowName SYSNAME, Value sql_variant)

INSERT INTO #temp_dbcc_opentran EXEC ('DBCC OPENTRAN WITH TABLERESULTS, NO_INFOMSGS');

SELECT @LSN = CAST(Value as NVARCHAR(128)) FROM #temp_dbcc_opentran WHERE RowName = 'REPL_NONDIST_OLD_LSN'

SELECT @SPID = REPLACE(CAST(Value as NVARCHAR(128)),'s','') FROM #temp_dbcc_opentran WHERE RowName = 'OLDACT_SPID'

IF @LSN = '(0:0:0)'

BEGIN

PRINT 'REPL_NONDIST_OLD_LSN = (0:0:0)'

PRINT 'Please try again'

END

ELSE

SELECT @SPID AS [Open transaction SPID] ,

[Begin Time] ,

DATEDIFF(MINUTE, [Begin Time], GETDATE()) AS [Log reader latency (Minutes)]

FROM ::fn_dblog(REPLACE(REPLACE(@LSN,'(',''),')',''),REPLACE(REPLACE(@LSN,'(',''),')',''))

WHERE [Begin Time] IS NOT NULL

IF EXISTS(SELECT 1 FROM tempdb.sys.tables WHERE NAME LIKE '#temp_dbcc_opentran%')

DROP TABLE #temp_dbcc_opentran

PRINT @SPID

IF @SPID IS NOT NULL

EXEC sp_who2 @SPID

This query will return the latency by session.

You can use EXEC sp_who3 SPID to get the query causing latency.

2. Check Log Reader agent latency based on Mslogreader_history table

On the distribution database, check to see if there are any recent errors recorded in the Mslogreader_history table.

use [DistributorDatabase]

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT s.srvname PublisherServer,

ma2.publisher_db PublisherDatabase,

ma2.name as JobName,

CASE mh1.runstatus

WHEN 1 THEN 'Start'

WHEN 2 THEN 'Succeed.'

WHEN 3 THEN 'In progress.'

WHEN 4 THEN 'Idle.'

WHEN 5 THEN 'Retry.'

WHEN 6 THEN 'Fail'

END AS Status,

(mh1.delivery_latency / 1000) AS LatencySeconds,

mh1.Comments

FROM MSlogreader_history mh1 WITH (NOLOCK)

INNER JOIN (SELECT mh1.agent_id, MAX(mh1.time) as maxtime

FROM MSlogreader_history mh1 WITH (NOLOCK)

JOIN MSlogreader_agents ma WITH (NOLOCK) on ma.id = mh1.agent_id

GROUP BY mh1.agent_id) AS mh2 ON mh1.agent_id = mh2.agent_id and mh1.time = mh2.maxtime

INNER JOIN MSlogreader_agents ma2 WITH (NOLOCK) on ma2.id = mh2.agent_id

INNER JOIN MSreplservers s WITH (NOLOCK) ON publisher_id = s.srvid

ORDER BY 1;

3. Determine any open long running transactions

On the publisher server/database, determine if there are any long running transactions. Long running transactions can generate a log of data modifications - > which leads to bigger log files → that requests the log reader to read through.

4. Determine if any blocking occurred

5. Reviewing any SQL Agent job that may be contributing to the log file growth (index maintenance)

6. Terminate Replication Monitor connections to the publisher's distributor server

7. Change the log reader' profile to Cleanup

8. Via DPA, determine if any performance issues going on either the Publisher or Distributor

9. Check the message: Replicated transactions are waiting for next Log backup or for mirroring partner to catch up.
This alert inform us that log reader will wait on transaction log backups to happen.
Need to enable trace flag 1448 to fix the issue:

:CONNECT server

-- Check trace flags enabled

DBCC TRACESTATUS (-1)

-- Enable trace flags

DBCC TRACEON (1448, -1);

DBCC TRACESTATUS (-1)

msrepl_commands alert. When this table is big, it means that something is wrong with replication (DT agents stopped, DT agents failing with errors, wrong settings for publications and so on).

The distribution database acts as a holding area for transactions waiting to be applied to all the subscribers. Transactions are fed into the distribution database by the Log Reader job which reads the transactions from the transaction log on the publisher. Once all the subscribers have received the transactions, the Distribution Cleanup job (runs on distributor servers) will go through the distribution database and delete any transactions that has been applied.

There are a few reasons why a Distribution Cleanup job may not delete transactions:

Distribution Cleanup job is disabled

Subscriptions are failing due to any number of reasons

A pull Distribution Agent is stopped on a subscriber

A publication's allow_anonymous and immediate_sync properties are set to true

Errors being reported in the msrepl_error table

Resolution

Restart the Distribution Cleanup job

Resolve all errors with subscriptions

Restart the pull Distribution Agent on the subscriber

Update the publication's settings to be false

SQL Server Standard Replication Options

Resolve any additional errors that were reported in the msrepl_error table

Informational: Distribution Cleanup Job

Runs on the distributor server

One cleanup agent per distribution database

Agent category: REPL-Distribution Cleanup

Runs every 10 minutes

Deletes data in batches of 5,000 for transactions and 2,000 for commands

You can adjust the batch sizes via SSMS
On the distributor server, right-click on 'Replication' in the object explorer and select 'Distributor Properties...'
In the Distribution database section, click on the properties option for the distribution database you want to modify

Deletes data from both msrepl_commands and msrepl_transaction tables

some standards:

Min retention = 0
Max retention = 48

Replication Troubleshooting: Log Reader Latency

Troubleshooting msrepl_commands alert

msrepl_commands alert. When this table is big, it means that something is wrong with replication (DT agents stopped, DT agents failing with errors, wrong settings for publications and so on).

POSSIBLE ISSUES TO BE INVESTIGATED: