Availability Group Latency Troubleshooting

Overview

AG Send/Redo overview
Determine if latency is in Send Queue, or Redo Queue
Check for open transactions, including idle sessions
Check for long-running active queries
Check Version Store
Check for parallel redo

Check AG Latency at Primary

Run this on the Primary Replica, in order to get info about all of the replicas.

Running this on a secondary replica will give you information only about that secondary replica.

In [ ]:

-- ▶ Run this on **PRIMARY**

SET NOCOUNT ON;

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT

--AG info

AGName = ag.name,

ReplicaServer = ar.replica_server_name,

DbName = adc.database_name,

-- Queue Size & rates

SendQueueSizeMB = drs.log_send_queue_size/1024,

RedoQueueSizeMB = drs.redo_queue_size/1024,

SendRateKb = drs.log_send_rate,

RedoRateKb = drs.redo_rate,

-- State Info

IsLocal = drs.is_local,

IsPrimary = drs.is_primary_replica,

SynchronizationState = drs.synchronization_state_desc,

SynchronizationHealthState = drs.synchronization_health_desc,

-- Last ____ Time

LastSentTime = drs.last_sent_time,

LastRedoTime = drs.last_redone_time,

LastReceivedTime = drs.last_received_time,

LastHardenedTime = drs.last_hardened_time,

LastCommitTime = drs.last_commit_time,

-- LSN info

RecoveryLSN = drs.recovery_lsn,

TruncationLSN = drs.truncation_lsn,

LastSentLSN = drs.last_sent_lsn,

LastReceivedLSN = drs.last_received_lsn,

LastHardenedLSN = drs.last_hardened_lsn,

LastRedoneLSN = drs.last_redone_lsn,

EndOfLogLSN = drs.end_of_log_lsn,

LastCommitLSN = drs.last_commit_lsn,

--Oh yeah, filestream, too

FileStreamSendRate = drs.filestream_send_rate

FROM sys.dm_hadr_database_replica_states AS drs

JOIN sys.availability_databases_cluster AS adc

ON drs.group_id = adc.group_id AND drs.group_database_id = adc.group_database_id

JOIN sys.availability_groups AS ag

ON ag.group_id = drs.group_id

JOIN sys.availability_replicas AS ar

ON drs.group_id = ar.group_id AND drs.replica_id = ar.replica_id

ORDER BY

-- ag.name, ar.replica_server_name, adc.database_name;

COALESCE(drs.log_send_queue_size,99999) + COALESCE(drs.redo_queue_size,99999) DESC;

Alternate check for Send Queue size

Sometimes the Perfmon counters & DMVs don't align 100%. This query will provide the higher of the two values.

▶ Run this on PRIMARY

In [ ]:

-- ▶ Run this on **PRIMARY**

SET NOCOUNT ON;

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

WITH SendQueue AS (

SELECT ServerName = rcs.replica_server_name,

DbName = db_name( drs.database_id),

UnsentLogKb = COALESCE(drs.log_send_queue_size,0)

FROM sys.dm_hadr_database_replica_states drs

JOIN sys.dm_hadr_availability_replica_cluster_states rcs ON rcs.replica_id = drs.replica_id

WHERE drs.last_sent_time IS NOT NULL

UNION ALL

SELECT ServerName = @@SERVERNAME,

DbName = LTRIM(RTRIM(instance_name)),

UnsentLogKb = cntr_value

FROM sys.[dm_os_performance_counters]

WHERE object_name = 'SQLServer:Database Replica'

AND counter_name = 'Log Send Queue'

)

SELECT

ServerName,

DbName,

UnsentLogKB = MAX(UnsentLogKb)

FROM SendQueue

GROUP BY ServerName, DbName

ORDER BY UnsentLogKb DESC;

Check for open transactions on Primary & Secondary (including idle sessions)

Open transactions can affect both Send & Redo.

Open transactions that exist on either Primary or Secondary can affect the Redo & Send Queues, so you must check for open transactions on BOTH sides of the servers experiencing latency.

▶ Run this on PRIMARY AND SECONDARY

In [ ]:

-- ▶ Run this on **PRIMARY AND SECONDARY**

DECLARE

@DurationThreshold smallint = 1,

@OnlySleepingSessions bit = 0

SET NOCOUNT ON;

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

--If we're in Debug mode, ignore @DurationThreshold parameter, Always use 1 minute.

DECLARE @Id int = 1,

@Spid int = 0,

@JobIdHex nvarchar(34),

@JobName nvarchar(256),

@Sql nvarchar(max),

@EmailFrom varchar(max),

@EmailBody nvarchar(max),

@EmailSubject nvarchar(255);

DROP TABLE IF EXISTS #OpenTrans;

CREATE TABLE #OpenTrans (

Id int identity(1,1) PRIMARY KEY,

Spid smallint,

BlockingSpid smallint,

TransactionLengthMinutes AS DATEDIFF(mi,TransactionStart,GETDATE()),

DbName sysname,

HostName nvarchar(128),

ProgramName nvarchar(128),

LoginName nvarchar(128),

LoginTime datetime2(3),

LastRequestStart datetime2(3),

LastRequestEnd datetime2(3),

TransactionCnt int,

TransactionStart datetime2(3),

TransactionState tinyint,

Command nvarchar(32),

WaitTime int,

WaitResource nvarchar(256),

SqlText nvarchar(max),

SqlStatement nvarchar(max),

InputBuffer nvarchar(4000),

SessionInfo xml

);

DROP TABLE IF EXISTS #InputBuffer;

CREATE TABLE #InputBuffer (

EventType nvarchar(30),

Params smallint,

EventInfo nvarchar(4000)

);

--Grab all sessions with open transactions

INSERT INTO #OpenTrans (Spid, BlockingSpid, DbName, HostName, ProgramName, LoginName, LoginTime, LastRequestStart,

LastRequestEnd, TransactionCnt, TransactionStart, TransactionState, Command, WaitTime, WaitResource, SqlText, SqlStatement)

SELECT s.session_id AS Spid,

r.blocking_session_id AS BlockingSpid,

COALESCE(db_name(dt.database_id),CAST(dt.database_id as nvarchar(10))) AS DbName,

s.host_name AS HostName,

s.program_name AS ProgramName,

s.login_name AS LoginName,

s.login_time AS LoginTime,

s.last_request_start_time AS LastRequestStart,

s.last_request_end_time AS LastRequestEnd,

-- Need to use sysprocesses for now until we're fully on 2012/2014

(SELECT TOP 1 sp.open_tran FROM master.sys.sysprocesses sp WHERE sp.spid = s.session_id) AS TransactionCnt,

--s.open_transaction_count AS TransactionCnt,

COALESCE(dt.database_transaction_begin_time,s.last_request_start_time) AS TransactionStart,

dt.database_transaction_state AS TransactionState,

r.command AS Command,

r.wait_time AS WaitTime,

r.wait_resource AS WaitResource,

COALESCE(t.text,'') AS SqlText,

COALESCE(SUBSTRING(t.text, (r.statement_start_offset/2)+1, (

(CASE r.statement_end_offset

WHEN -1 THEN DATALENGTH(t.text)

ELSE r.statement_end_offset

END - r.statement_start_offset)

/2) + 1),'') AS SqlStatement

FROM sys.dm_exec_sessions s

JOIN sys.dm_tran_session_transactions st ON st.session_id = s.session_id

JOIN sys.dm_tran_database_transactions dt ON dt.transaction_id = st.transaction_id

LEFT JOIN sys.dm_exec_requests r ON r.session_id = s.session_id

OUTER APPLY sys.dm_exec_sql_text (r.sql_handle) t

WHERE dt.database_transaction_state NOT IN (3) -- 3 means transaction has been initialized but has not generated any log records. Ignore it

AND (@OnlySleepingSessions = 0 OR s.status = 'sleeping')

AND COALESCE(dt.database_transaction_begin_time,s.last_request_start_time) < DATEADD(mi,-1*@DurationThreshold ,GETDATE());

-- Grab the input buffer for all sessions, too.

WHILE EXISTS (SELECT 1 FROM #OpenTrans WHERE InputBuffer IS NULL)

BEGIN

TRUNCATE TABLE #InputBuffer;

SELECT TOP 1 @Spid = Spid, @Id = Id

FROM #OpenTrans

WHERE InputBuffer IS NULL;

SET @Sql = 'DBCC INPUTBUFFER (' + CAST(@Spid AS varchar(10)) + ');';

BEGIN TRY

INSERT INTO #InputBuffer

EXEC sp_executesql @sql;

END TRY

BEGIN CATCH

PRINT 'InputBuffer Failed';

END CATCH

UPDATE b

SET InputBuffer = COALESCE((SELECT TOP 1 EventInfo FROM #InputBuffer),'')

FROM #OpenTrans b

WHERE ID = @Id;

END;

--Convert Hex job_ids for SQL Agent jobs to names.

WHILE EXISTS(SELECT 1 FROM #OpenTrans WHERE ProgramName LIKE 'SQLAgent - TSQL JobStep (Job 0x%')

BEGIN

SELECT @JobIdHex = '', @JobName = '';

SELECT TOP 1 @ID = ID,

@JobIdHex = SUBSTRING(ProgramName,30,34)

FROM #OpenTrans

WHERE ProgramName LIKE 'SQLAgent - TSQL JobStep (Job 0x%';

SELECT @Sql = N'SELECT @JobName = name FROM msdb.dbo.sysjobs WHERE job_id = ' + @JobIdHex;

EXEC sp_executesql @Sql, N'@JobName nvarchar(256) OUT', @JobName = @JobName OUT;

UPDATE b

SET ProgramName = LEFT(REPLACE(ProgramName,@JobIdHex,@JobName),128)

FROM #OpenTrans b

WHERE ID = @Id;

END;

-- Populate SessionInfo column with HTML details for sending email

-- Since there's a bunch of logic here, code is more readable doing this separate than mashing it in with the rest of HTML email creation

UPDATE t

SET SessionInfo = (SELECT TransactionState =

CASE TransactionState

WHEN 1 THEN 'The transaction has not been initialized.'

WHEN 3 THEN 'The transaction has been initialized but has not generated any log records.' -- We don�t alert on this status

WHEN 4 THEN 'The transaction has generated log records.'

WHEN 5 THEN 'The transaction has been prepared.'

WHEN 10 THEN 'The transaction has been committed.'

WHEN 11 THEN 'The transaction has been rolled back.'

WHEN 12 THEN 'The transaction is being committed. In this state the log record is being generated, but it has not been materialized or persisted.'

ELSE CAST(TransactionState as varchar)

END,

TransactionLengthMinutes = CONVERT(varchar(20),TransactionLengthMinutes,20),

SessionID = Spid,

DbName,

LoginName,

HostName,

DbName,

WaitResource,

LoginTime = CONVERT(varchar(20),LoginTime,20),

LastRequest = CONVERT(varchar(20),LastRequestStart,20),

ProgramName

FROM #OpenTrans t2

WHERE t2.id = t.id

FOR XML PATH ('Transaction') )

FROM #OpenTrans t;

--output results in debug mode:

IF NOT EXISTS (SELECT 1 FROM #OpenTrans)

SELECT 'No Open Transactions longer than ' + CAST(@DurationThreshold AS varchar(10)) + ' minutes exist' AS OpenTransactions;

ELSE

BEGIN

SELECT * FROM #OpenTrans ORDER BY TransactionLengthMinutes DESC;

END;

Check for long-running active queries

Like long-running transactions, long-running queries can affect both Send & Redo.

Long-running queries that exist on either Primary or Secondary can affect the Redo & Send Queues, so you must check for long-running queries on BOTH sides of the servers experiencing latency.

▶ Run this on PRIMARY AND SECONDARY

In [ ]:

-- ▶ Run this on **PRIMARY AND SECONDARY**

EXEC sp_whoisactive;

Check for bloated Version Store

Long-running active queries, or open transactions (evaluated in the prior two queries) will cause the Version Store in tempdb to become bloated. When the Version Store becomes very large, this can cause instance-wide performance issues.

▶ Run this on PRIMARY AND SECONDARY

In [ ]:

-- ▶ Run this on **PRIMARY AND SECONDARY**

SET NOCOUNT ON;

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT

LongestTransactionRunTimeSeconds = CONVERT(int,cntr_value),

LongestTransactionRunTimeMinutes = CONVERT(int,cntr_value)/60.0,

EstimatedTransactionStartTime = DATEADD(SECOND,-1*COALESCE(CONVERT(int,cntr_value),0),GETDATE())

FROM sys.dm_os_performance_counters

WHERE counter_name = 'Longest Transaction Running Time';

SELECT

PerfmonCounter = c.counter_name,

CounterValue = c.cntr_value

FROM sys.dm_os_performance_counters AS c

WHERE c.counter_name LIKE 'Version%'

AND c.object_name = 'SQLServer:Transactions';

Check for databases in Recovery

Check the error log to determine if any databases are currently or recently going through crash recovery. Using the log here rather than current status to include past events.

▶ Run this on PRIMARY AND SECONDARY

In [ ]:

-- ▶ Run this on **PRIMARY AND SECONDARY**

EXEC xp_readerrorlog 0,1, N'Recovery of database';

TempDB space usage

In addition to viewing Version Store size by viewing Perfmon counters, you can look at TempDB allocation space.

Here, we can see how much space is allocated to Version Store, as well as user objects (ex #temp tables), internal objects (ex spools, hash tables), and other allocation information.

▶ Run this on PRIMARY AND SECONDARY

In [ ]:

-- ▶ Run this on **PRIMARY AND SECONDARY**

USE tempdb

SET NOCOUNT ON;

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT

VersionStoreGB = SUM(fsu.version_store_reserved_page_count) *8.0/1024/1024,

UserObjectsGB = SUM(fsu.user_object_reserved_page_count) *8.0/1024/1024,

InternalObjectsGB = SUM(fsu.internal_object_reserved_page_count)*8.0/1024/1024,

UnallocatedGB = SUM(fsu.unallocated_extent_page_count) *8.0/1024/1024,

MixedExtentsGB = SUM(fsu.mixed_extent_page_count) *8.0/1024/1024

FROM sys.dm_db_file_space_usage AS fsu;

Check VLFs

The Redo / DB Startup process can be slowed down by an excessive number of VLFs.

If a database has more than 1000 VLFs, consider an action item to reduce the number of VLFs. However, you will not be able to fix the VLF count while AG latency is high.

▶ Run this on PRIMARY

In [ ]:

-- ▶ Run this on **PRIMARY**

SET NOCOUNT ON;

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT

DbName = d.name,

AverageVlfMb = AVG(dli.vlf_size_mb),

LargestVlfMb = MAX(dli.vlf_size_mb),

VlfCount = COUNT(*)

FROM sys.databases AS d

CROSS APPLY sys.dm_db_log_info(d.database_id) AS dli

GROUP BY d.name

ORDER BY COUNT(*) DESC;

Check which DBs (if any) have parallel redo

Redo is usually single-threaded. SQL Server does support parallel redo, but only for a small number of databases. If this AG has many databases, only a few will benefit from parallel redo. Controlling which DBs will benefit from parallel redo is difficult (practically impossible), so generally assume that any given database will only have single-threaded redo.

▶ Run this on SECONDARY

In [ ]:

-- ▶ Run this on **SECONDARY**

SET NOCOUNT ON;

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT DbName = db_name(r.database_id),

RedoThreads = COUNT(*)

FROM sys.dm_exec_requests AS r

WHERE r.command = 'DB STARTUP'

GROUP BY r.database_id

ORDER BY COUNT(*) DESC;

In [ ]:

Parallel redo can be disabled by enabling Trace Flag 3459. Check if parallel redo has been disabled.

If Trace Flag 3459 is on (Status = 1) if Parallel Redo is disabled.

If Trace Flag 3459 is off (Status = 0) if Parallel Redo is enabled.

▶ Run this on SECONDARY

In [ ]:

-- ▶ Run this on **SECONDARY**

DBCC TRACESTATUS(3459);

Further Investigations

Check for changes in data volume / data velocity
- Was there a recent release?
- Did someone just upload a bunch of data?
- You could estimate changes in data velocity by looking at the size of historical transaction log backups
- A significant increase in data velocity will contribute to both send and redo latencies
Check for IO latency on the secondary server
- Write latency on the secondary log file will contribute to send latency
- Read latency on the secondary log file will contribute to redo latency
- Write latency on the secondary data file will contribute to redo latency
Check for network latency between servers
- Network latency will contribute to send latency

Possible Actions

DO NOT RESTART THE SECONDARY SERVER UNLESS ALL OPTIONS HAVE BEEN EXHAUSTED
- Restarting a secondary replica will force the existing log file to be processed via crash recovery before the Db is available for continued participation in the Availability Group.
- Send Queue will build for the duration of the crash recovery processing.
If send queue is large for a DB, look for a low send rate
- If the send rate is low, do you have network issues?
  - If the network allows, can you copy a file between the two servers?
- Turn it off & turn it back on again
  - Restart the AG Endpoints
- Look for long running SELECT queries
  - You may need to kill excessively long queries
- Look for long running transactions
  - You may need to kill, or get the user to commit. WARNING: Killing long running data changes can result in excessive rollback.
If redo queue is large for a DB, look for low redo rate
- Look for long running SELECT queries
  - You may need to kill excessively long queries
- Look for long running transactions
  - You may need to kill, or get the user to commit. WARNING: Killing long running data changes can result in excessive rollback.
- Look for IO latency on the SECONDARY
  - Look for READ latency on the LOG FILE
  - Look for WRITE latency on the DATA FILE
  - Note: Read latency on Log, write latency on Data file is backwards from what you might expect

Sources

Troubleshooting data movement latency between synchronous-commit AlwaysOn Availability Groups - Simon Su, Microsoft
Monitoring Availability Group Replica Synchronization - Jonathan Kehayias, SQLSkills
Measuring Availability Group synchronization lag - Derik Hammer
Long Running Query on Read-Only Replica that takes moments on the Primary - Stack Overflow question & answers
What the heck is the Version Store? - Andy Mallon
Tempdb Version Store cleanup - Paul White
DBA Database - Andy Mallon
sp_WhoIsActive documentation - Adam Machanic
sys.dm_db_file_space_usage - Microsoft Docs

Page updated

Google Sites

Report abuse