AG Send/Redo overview
Determine if latency is in Send Queue, or Redo Queue
Check for open transactions, including idle sessions
Check for long-running active queries
Check Version Store
Check for parallel redo
Run this on the Primary Replica, in order to get info about all of the replicas.
Running this on a secondary replica will give you information only about that secondary replica.
In [ ]:
-- ▶ Run this on **PRIMARY**
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SELECT
--AG info
AGName = ag.name,
ReplicaServer = ar.replica_server_name,
DbName = adc.database_name,
-- Queue Size & rates
SendQueueSizeMB = drs.log_send_queue_size/1024,
RedoQueueSizeMB = drs.redo_queue_size/1024,
SendRateKb = drs.log_send_rate,
RedoRateKb = drs.redo_rate,
-- State Info
IsLocal = drs.is_local,
IsPrimary = drs.is_primary_replica,
SynchronizationState = drs.synchronization_state_desc,
SynchronizationHealthState = drs.synchronization_health_desc,
-- Last ____ Time
LastSentTime = drs.last_sent_time,
LastRedoTime = drs.last_redone_time,
LastReceivedTime = drs.last_received_time,
LastHardenedTime = drs.last_hardened_time,
LastCommitTime = drs.last_commit_time,
-- LSN info
RecoveryLSN = drs.recovery_lsn,
TruncationLSN = drs.truncation_lsn,
LastSentLSN = drs.last_sent_lsn,
LastReceivedLSN = drs.last_received_lsn,
LastHardenedLSN = drs.last_hardened_lsn,
LastRedoneLSN = drs.last_redone_lsn,
EndOfLogLSN = drs.end_of_log_lsn,
LastCommitLSN = drs.last_commit_lsn,
--Oh yeah, filestream, too
FileStreamSendRate = drs.filestream_send_rate
FROM sys.dm_hadr_database_replica_states AS drs
JOIN sys.availability_databases_cluster AS adc
ON drs.group_id = adc.group_id AND drs.group_database_id = adc.group_database_id
JOIN sys.availability_groups AS ag
ON ag.group_id = drs.group_id
JOIN sys.availability_replicas AS ar
ON drs.group_id = ar.group_id AND drs.replica_id = ar.replica_id
ORDER BY
-- ag.name, ar.replica_server_name, adc.database_name;
COALESCE(drs.log_send_queue_size,99999) + COALESCE(drs.redo_queue_size,99999) DESC;
Sometimes the Perfmon counters & DMVs don't align 100%. This query will provide the higher of the two values.
▶ Run this on PRIMARY
In [ ]:
-- ▶ Run this on **PRIMARY**
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
WITH SendQueue AS (
SELECT ServerName = rcs.replica_server_name,
DbName = db_name( drs.database_id),
UnsentLogKb = COALESCE(drs.log_send_queue_size,0)
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.dm_hadr_availability_replica_cluster_states rcs ON rcs.replica_id = drs.replica_id
WHERE drs.last_sent_time IS NOT NULL
UNION ALL
SELECT ServerName = @@SERVERNAME,
DbName = LTRIM(RTRIM(instance_name)),
UnsentLogKb = cntr_value
FROM sys.[dm_os_performance_counters]
WHERE object_name = 'SQLServer:Database Replica'
AND counter_name = 'Log Send Queue'
)
SELECT
ServerName,
DbName,
UnsentLogKB = MAX(UnsentLogKb)
FROM SendQueue
GROUP BY ServerName, DbName
ORDER BY UnsentLogKb DESC;
Open transactions can affect both Send & Redo.
Open transactions that exist on either Primary or Secondary can affect the Redo & Send Queues, so you must check for open transactions on BOTH sides of the servers experiencing latency.
▶ Run this on PRIMARY AND SECONDARY
In [ ]:
-- ▶ Run this on **PRIMARY AND SECONDARY**
DECLARE
@DurationThreshold smallint = 1,
@OnlySleepingSessions bit = 0
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
--If we're in Debug mode, ignore @DurationThreshold parameter, Always use 1 minute.
DECLARE @Id int = 1,
@Spid int = 0,
@JobIdHex nvarchar(34),
@JobName nvarchar(256),
@Sql nvarchar(max),
@EmailFrom varchar(max),
@EmailBody nvarchar(max),
@EmailSubject nvarchar(255);
DROP TABLE IF EXISTS #OpenTrans;
CREATE TABLE #OpenTrans (
Id int identity(1,1) PRIMARY KEY,
Spid smallint,
BlockingSpid smallint,
TransactionLengthMinutes AS DATEDIFF(mi,TransactionStart,GETDATE()),
DbName sysname,
HostName nvarchar(128),
ProgramName nvarchar(128),
LoginName nvarchar(128),
LoginTime datetime2(3),
LastRequestStart datetime2(3),
LastRequestEnd datetime2(3),
TransactionCnt int,
TransactionStart datetime2(3),
TransactionState tinyint,
Command nvarchar(32),
WaitTime int,
WaitResource nvarchar(256),
SqlText nvarchar(max),
SqlStatement nvarchar(max),
InputBuffer nvarchar(4000),
SessionInfo xml
);
DROP TABLE IF EXISTS #InputBuffer;
CREATE TABLE #InputBuffer (
EventType nvarchar(30),
Params smallint,
EventInfo nvarchar(4000)
);
--Grab all sessions with open transactions
INSERT INTO #OpenTrans (Spid, BlockingSpid, DbName, HostName, ProgramName, LoginName, LoginTime, LastRequestStart,
LastRequestEnd, TransactionCnt, TransactionStart, TransactionState, Command, WaitTime, WaitResource, SqlText, SqlStatement)
SELECT s.session_id AS Spid,
r.blocking_session_id AS BlockingSpid,
COALESCE(db_name(dt.database_id),CAST(dt.database_id as nvarchar(10))) AS DbName,
s.host_name AS HostName,
s.program_name AS ProgramName,
s.login_name AS LoginName,
s.login_time AS LoginTime,
s.last_request_start_time AS LastRequestStart,
s.last_request_end_time AS LastRequestEnd,
-- Need to use sysprocesses for now until we're fully on 2012/2014
(SELECT TOP 1 sp.open_tran FROM master.sys.sysprocesses sp WHERE sp.spid = s.session_id) AS TransactionCnt,
--s.open_transaction_count AS TransactionCnt,
COALESCE(dt.database_transaction_begin_time,s.last_request_start_time) AS TransactionStart,
dt.database_transaction_state AS TransactionState,
r.command AS Command,
r.wait_time AS WaitTime,
r.wait_resource AS WaitResource,
COALESCE(t.text,'') AS SqlText,
COALESCE(SUBSTRING(t.text, (r.statement_start_offset/2)+1, (
(CASE r.statement_end_offset
WHEN -1 THEN DATALENGTH(t.text)
ELSE r.statement_end_offset
END - r.statement_start_offset)
/2) + 1),'') AS SqlStatement
FROM sys.dm_exec_sessions s
JOIN sys.dm_tran_session_transactions st ON st.session_id = s.session_id
JOIN sys.dm_tran_database_transactions dt ON dt.transaction_id = st.transaction_id
LEFT JOIN sys.dm_exec_requests r ON r.session_id = s.session_id
OUTER APPLY sys.dm_exec_sql_text (r.sql_handle) t
WHERE dt.database_transaction_state NOT IN (3) -- 3 means transaction has been initialized but has not generated any log records. Ignore it
AND (@OnlySleepingSessions = 0 OR s.status = 'sleeping')
AND COALESCE(dt.database_transaction_begin_time,s.last_request_start_time) < DATEADD(mi,-1*@DurationThreshold ,GETDATE());
-- Grab the input buffer for all sessions, too.
WHILE EXISTS (SELECT 1 FROM #OpenTrans WHERE InputBuffer IS NULL)
BEGIN
TRUNCATE TABLE #InputBuffer;
SELECT TOP 1 @Spid = Spid, @Id = Id
FROM #OpenTrans
WHERE InputBuffer IS NULL;
SET @Sql = 'DBCC INPUTBUFFER (' + CAST(@Spid AS varchar(10)) + ');';
BEGIN TRY
INSERT INTO #InputBuffer
EXEC sp_executesql @sql;
END TRY
BEGIN CATCH
PRINT 'InputBuffer Failed';
END CATCH
UPDATE b
SET InputBuffer = COALESCE((SELECT TOP 1 EventInfo FROM #InputBuffer),'')
FROM #OpenTrans b
WHERE ID = @Id;
END;
--Convert Hex job_ids for SQL Agent jobs to names.
WHILE EXISTS(SELECT 1 FROM #OpenTrans WHERE ProgramName LIKE 'SQLAgent - TSQL JobStep (Job 0x%')
BEGIN
SELECT @JobIdHex = '', @JobName = '';
SELECT TOP 1 @ID = ID,
@JobIdHex = SUBSTRING(ProgramName,30,34)
FROM #OpenTrans
WHERE ProgramName LIKE 'SQLAgent - TSQL JobStep (Job 0x%';
SELECT @Sql = N'SELECT @JobName = name FROM msdb.dbo.sysjobs WHERE job_id = ' + @JobIdHex;
EXEC sp_executesql @Sql, N'@JobName nvarchar(256) OUT', @JobName = @JobName OUT;
UPDATE b
SET ProgramName = LEFT(REPLACE(ProgramName,@JobIdHex,@JobName),128)
FROM #OpenTrans b
WHERE ID = @Id;
END;
-- Populate SessionInfo column with HTML details for sending email
-- Since there's a bunch of logic here, code is more readable doing this separate than mashing it in with the rest of HTML email creation
UPDATE t
SET SessionInfo = (SELECT TransactionState =
CASE TransactionState
WHEN 1 THEN 'The transaction has not been initialized.'
WHEN 3 THEN 'The transaction has been initialized but has not generated any log records.' -- We don�t alert on this status
WHEN 4 THEN 'The transaction has generated log records.'
WHEN 5 THEN 'The transaction has been prepared.'
WHEN 10 THEN 'The transaction has been committed.'
WHEN 11 THEN 'The transaction has been rolled back.'
WHEN 12 THEN 'The transaction is being committed. In this state the log record is being generated, but it has not been materialized or persisted.'
ELSE CAST(TransactionState as varchar)
END,
TransactionLengthMinutes = CONVERT(varchar(20),TransactionLengthMinutes,20),
SessionID = Spid,
DbName,
LoginName,
HostName,
DbName,
WaitResource,
LoginTime = CONVERT(varchar(20),LoginTime,20),
LastRequest = CONVERT(varchar(20),LastRequestStart,20),
ProgramName
FROM #OpenTrans t2
WHERE t2.id = t.id
FOR XML PATH ('Transaction') )
FROM #OpenTrans t;
--output results in debug mode:
IF NOT EXISTS (SELECT 1 FROM #OpenTrans)
SELECT 'No Open Transactions longer than ' + CAST(@DurationThreshold AS varchar(10)) + ' minutes exist' AS OpenTransactions;
ELSE
BEGIN
SELECT * FROM #OpenTrans ORDER BY TransactionLengthMinutes DESC;
END;
GO
Like long-running transactions, long-running queries can affect both Send & Redo.
Long-running queries that exist on either Primary or Secondary can affect the Redo & Send Queues, so you must check for long-running queries on BOTH sides of the servers experiencing latency.
▶ Run this on PRIMARY AND SECONDARY
In [ ]:
-- ▶ Run this on **PRIMARY AND SECONDARY**
EXEC sp_whoisactive;
Long-running active queries, or open transactions (evaluated in the prior two queries) will cause the Version Store in tempdb to become bloated. When the Version Store becomes very large, this can cause instance-wide performance issues.
▶ Run this on PRIMARY AND SECONDARY
In [ ]:
-- ▶ Run this on **PRIMARY AND SECONDARY**
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SELECT
LongestTransactionRunTimeSeconds = CONVERT(int,cntr_value),
LongestTransactionRunTimeMinutes = CONVERT(int,cntr_value)/60.0,
EstimatedTransactionStartTime = DATEADD(SECOND,-1*COALESCE(CONVERT(int,cntr_value),0),GETDATE())
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Longest Transaction Running Time';
SELECT
PerfmonCounter = c.counter_name,
CounterValue = c.cntr_value
FROM sys.dm_os_performance_counters AS c
WHERE c.counter_name LIKE 'Version%'
AND c.object_name = 'SQLServer:Transactions';
Check the error log to determine if any databases are currently or recently going through crash recovery. Using the log here rather than current status to include past events.
▶ Run this on PRIMARY AND SECONDARY
In [ ]:
-- ▶ Run this on **PRIMARY AND SECONDARY**
EXEC xp_readerrorlog 0,1, N'Recovery of database';
In addition to viewing Version Store size by viewing Perfmon counters, you can look at TempDB allocation space.
Here, we can see how much space is allocated to Version Store, as well as user objects (ex #temp tables), internal objects (ex spools, hash tables), and other allocation information.
▶ Run this on PRIMARY AND SECONDARY
In [ ]:
-- ▶ Run this on **PRIMARY AND SECONDARY**
USE tempdb
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SELECT
VersionStoreGB = SUM(fsu.version_store_reserved_page_count) *8.0/1024/1024,
UserObjectsGB = SUM(fsu.user_object_reserved_page_count) *8.0/1024/1024,
InternalObjectsGB = SUM(fsu.internal_object_reserved_page_count)*8.0/1024/1024,
UnallocatedGB = SUM(fsu.unallocated_extent_page_count) *8.0/1024/1024,
MixedExtentsGB = SUM(fsu.mixed_extent_page_count) *8.0/1024/1024
FROM sys.dm_db_file_space_usage AS fsu;
The Redo / DB Startup process can be slowed down by an excessive number of VLFs.
If a database has more than 1000 VLFs, consider an action item to reduce the number of VLFs. However, you will not be able to fix the VLF count while AG latency is high.
▶ Run this on PRIMARY
In [ ]:
-- ▶ Run this on **PRIMARY**
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SELECT
DbName = d.name,
AverageVlfMb = AVG(dli.vlf_size_mb),
LargestVlfMb = MAX(dli.vlf_size_mb),
VlfCount = COUNT(*)
FROM sys.databases AS d
CROSS APPLY sys.dm_db_log_info(d.database_id) AS dli
GROUP BY d.name
ORDER BY COUNT(*) DESC;
Redo is usually single-threaded. SQL Server does support parallel redo, but only for a small number of databases. If this AG has many databases, only a few will benefit from parallel redo. Controlling which DBs will benefit from parallel redo is difficult (practically impossible), so generally assume that any given database will only have single-threaded redo.
▶ Run this on SECONDARY
In [ ]:
-- ▶ Run this on **SECONDARY**
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SELECT DbName = db_name(r.database_id),
RedoThreads = COUNT(*)
FROM sys.dm_exec_requests AS r
WHERE r.command = 'DB STARTUP'
GROUP BY r.database_id
ORDER BY COUNT(*) DESC;
In [ ]:
Parallel redo can be disabled by enabling Trace Flag 3459. Check if parallel redo has been disabled.
If Trace Flag 3459 is on (Status = 1) if Parallel Redo is disabled.
If Trace Flag 3459 is off (Status = 0) if Parallel Redo is enabled.
▶ Run this on SECONDARY
In [ ]:
-- ▶ Run this on **SECONDARY**
DBCC TRACESTATUS(3459);
Check for changes in data volume / data velocity
Was there a recent release?
Did someone just upload a bunch of data?
You could estimate changes in data velocity by looking at the size of historical transaction log backups
A significant increase in data velocity will contribute to both send and redo latencies
Check for IO latency on the secondary server
Write latency on the secondary log file will contribute to send latency
Read latency on the secondary log file will contribute to redo latency
Write latency on the secondary data file will contribute to redo latency
Check for network latency between servers
Network latency will contribute to send latency
DO NOT RESTART THE SECONDARY SERVER UNLESS ALL OPTIONS HAVE BEEN EXHAUSTED
Restarting a secondary replica will force the existing log file to be processed via crash recovery before the Db is available for continued participation in the Availability Group.
Send Queue will build for the duration of the crash recovery processing.
If send queue is large for a DB, look for a low send rate
If the send rate is low, do you have network issues?
If the network allows, can you copy a file between the two servers?
Turn it off & turn it back on again
Restart the AG Endpoints
Look for long running SELECT queries
You may need to kill excessively long queries
Look for long running transactions
You may need to kill, or get the user to commit. WARNING: Killing long running data changes can result in excessive rollback.
If redo queue is large for a DB, look for low redo rate
Look for long running SELECT queries
You may need to kill excessively long queries
Look for long running transactions
You may need to kill, or get the user to commit. WARNING: Killing long running data changes can result in excessive rollback.
Look for IO latency on the SECONDARY
Look for READ latency on the LOG FILE
Look for WRITE latency on the DATA FILE
Note: Read latency on Log, write latency on Data file is backwards from what you might expect
Troubleshooting data movement latency between synchronous-commit AlwaysOn Availability Groups - Simon Su, Microsoft
Monitoring Availability Group Replica Synchronization - Jonathan Kehayias, SQLSkills
Measuring Availability Group synchronization lag - Derik Hammer
Long Running Query on Read-Only Replica that takes moments on the Primary - Stack Overflow question & answers
What the heck is the Version Store? - Andy Mallon
Tempdb Version Store cleanup - Paul White
DBA Database - Andy Mallon
sp_WhoIsActive documentation - Adam Machanic
sys.dm_db_file_space_usage - Microsoft Docs