Contents

Sql Virtual Machine Race Condition

I’ve been having problems deploying a Sql Virtual Machine in Azure lately and decided to perform some tests to get to the bottom of the issue. I’d like to share some strange behaviour.

Problem

If I deploy a new SQL VM using bicep, it deploys fine. If however, I redeploy the same VM the deployment fails with this error:

Ext_AutomatedBackupError
  • code: Ext_AutomatedBackupError
  • Error: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.;System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. —> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out.

In the SQL Server ERRORLOG on the VM:

1
2
3
2022-03-06 08:24:17.62 Logon       Error: 18456, Severity: 14, State: 38.
2022-03-06 08:24:17.62 Logon       Login failed for user 'NT Service\SQLIaaSExtensionQuery'. Reason: Failed to open the explicitly specified database 'msdb'. [CLIENT: <local machine>]
2022-03-06 08:24:17.74 spid8s      Starting up database 'msdb'.

Delving into the previous ERRORLOG file, ERRORLOG.1 I can see this

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
2022-03-28 09:13:37.60 spid11s     Clearing tempdb database.
2022-03-28 09:13:37.60 spid11s     Error: 5123, Severity: 16, State: 1.
2022-03-28 09:13:37.60 spid11s     CREATE FILE encountered operating system error 3(The system cannot find the path specified.) while attempting to open or create the physical file 'D:\SQLTemp\Data\tempdb.mdf'.
2022-03-28 09:13:37.65 spid11s     Error: 17204, Severity: 16, State: 1.
2022-03-28 09:13:37.65 spid11s     FCB::Open failed: Could not open file D:\SQLTemp\Data\tempdb.mdf for file number 1.  OS error: 3(The system cannot find the path specified.).
2022-03-28 09:13:37.65 spid11s     Error: 5120, Severity: 16, State: 101.
2022-03-28 09:13:37.65 spid11s     Unable to open the physical file "D:\SQLTemp\Data\tempdb.mdf". Operating system error 3: "3(The system cannot find the path specified.)".
2022-03-28 09:13:37.65 spid11s     Error: 1802, Severity: 16, State: 4.
2022-03-28 09:13:37.65 spid11s     CREATE DATABASE failed. Some file names listed could not be created. Check related errors.
2022-03-28 09:13:37.65 spid11s     Could not create tempdb. You may not have enough disk space available. Free additional disk space by deleting other files on the tempdb drive and then restart SQL Server. Check for additional errors in the operating system error log that may indicate why the tempdb files could not be initialized.

This suggests that SQL Server is unable to create tempdb.

Machine specs

  • Location: UK South
  • VM Size: Standard_D2ds_v4
  • SqlImageType: sql2019-ws2022
  • SqlImageSku: sqldev-gen2

If you want to reproduce the issues I am having I created a repo here. Just clone the repo and change into the repo directory in PowerShell 7, Connect to the Azure subscription you want to deploy to and then run Deploy.ps1.

Initial deploy works fine

Takes 8 minutes to run.

/blog/sqliaas-problem/InitialDeploy.gif
Initial deploy (click to enlarge) Video length: 00:10
Note
Requires latest bicep and PowerShell 7.

Redeploy without rebooting

Takes 2 minutes to run.

/blog/sqliaas-problem/RedeployWithoutRebooting.gif
Redeploy without rebooting (click to enlarge) Video length: 00:10

Redeploy after VM deallocated

/blog/sqliaas-problem/RedeployAfterDeallocate.gif
Redeploy after deallocation (click to enlarge) Video length: 00:40

Strange workaround 1

Restart the SQL Server service and redeploy

/blog/sqliaas-problem/RedeployRestartSql.gif
Redeploy after Sql Service restart (click to enlarge) Video length: 00:47

Strange workaround 2

Change the VM Size to Standard_D2ds_v4 and redeploy (same as rebooting?)

/blog/sqliaas-problem/RedeployResized.gif
Redeploy after Sql Service restart (click to enlarge) Video length: 00:27

Microsoft Support

Suggestion 1: Delayed Start

I raised a Support Request with Microsoft and the suggestion that came back was to set the SQL Server service to Delayed Start. This is because there is a program scheduled using the task scheduler to create the D:\SQLTemp\Data and D:\SQLTemp\Log directories on server boot. Looking in Task Scheduler, there is indeed a task called SqlStartUp which creates these tempdb directories.

/blog/sqliaas-problem/2022-03-28_10-45-13.jpg
SqlServerStarter.exe in SqlIaaS resource

This is a race condition, where if the SQL Server service wins the race with the SqlStartUp scheduled task, SQL Server will fail to start, because the tempdb directories do not exist.

I tried the Delayed Start suggestion but I get the same error and I suspect that it worked for Microsoft because the race condition just happened to fall on the correct side. However, when I run the deployment I get the same SqlTimeOut error.

Suggestion 2: Manual Start

The next suggestion from Microsoft that seemed to work for them was to use Manual Start for the SQL Server and SQL Server Agent services. However when I tried it, I get the same SqlTimeOut error. Again, classic race condition.

Suggestion 3: Place TempDb on Premium attached storage

Adding another premium disk to the machine and configuring TempDb file to reside on there “solves” the problem. However according to Microsoft’s best practises guide for SQL Server on VMs, TempDb should be placed on the ephemeral local SSD drive (D:\ by default).

Alternative home-grown solution

One solution I am considering, but would rather not, because I am coding around Microsoft, is to set the SQL Server services to Manual start, and run a PowerShell script on startup which creates the TempDb directories, followed by a start of the SQL server services. This should eliminate the race condition.

Conclusion

For now, I will configure the Sql Virtual Machines to use premium attached storage. In my environment it is not too important to have ultra-fast TempDb performance, but I know for some people that is a requirement. I have created a branch called workaround in the repo to show you have I’ve done that. See https://github.com/markallisongit/sqliaas-demo/tree/workaround

Microsoft have escalated the issue to the SQL Product Team, and I’m still waiting for that and will post an update when I hear back.

References

https://docs.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/performance-guidelines-best-practices-vm-size

https://docs.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/performance-guidelines-best-practices-checklist