This week we had a clients 2016 RDS server require a reboot. The reboot was done because it’d been up for 60 days and was being cantankerous, you know the typical kind of behaviour that we’ve seen when Windows needs a reboot. So our tech rebooted the server. This server was a new one that was built just a few months back now, and at the time was fully patched etc. It runs on Hyper-V and it’s been performing fine. After a few minutes however the server was not back up and running. It was in fact showing the spinning circle with the black screen that Windows normally does briefly during the boot process. This concerned us, we left it for a further 5 minute before deciding that there might be issues.
We power cycled the machine again and it gave exactly the same results. Given that this client had Hyper-V replica – we decided to do a test failover of the VM to the other host. Guess what… it did exactly the same thing too. Ok – so the client has a Datto backup of this server as well. We tried to boot it using the Datto boot features on the Datto appliance. Guess what… it did the exact same thing there. We tried backups over the past 5 days and all had the same thing. At this point we began to panic and become concerned. We also tried to boot it without having the network card connected as we’d seen issues like that in the past. No luck there either.
I broke out the Windows Recovery environment and dug into the system – we were looking to see if maybe there was a pending.xml file there that was causing patching issues, but there was not. We tried a few other things but nothing seemed to work. I even decided to mount the VHDX file that was the C: drive of the VM and run a chkdsk on it – it too proved to be fine.
About this time, we broke this issue out of the two people working on it to our wider team to see if anyone else in the team had other ideas. One of our team recalled a similar situation a few weeks back on a Windows 10 machine that did the same thing. It was a physical machine. His solution at that time was to just leave it alone and 20 minutes later it had come back. We figured we had nothing to loose by leaving the machine booting while we dug into other potential solutions, so we left it alone. Sure enough – 20 minutes after boot the machine came up without issue. Around the same time frame (ie 20 minutes) the machine we were trying to boot on the Datto device ALSO came up. Wow.
At that point we got the client back online as the downtime was more than was planned by the simple reboot. Later that night we rebooted the machine again and it came back within 90 seconds without issue.
Later that day I was at the Sydney SMB IT Professionals meeting and after the meeting without me mentioning it to others, I had two individual people come and ask if I’ve seen an issue like this!!! They have seen it on both Windows 10 and Windows Server 2016 over the last few months all without resolution aside from leaving it alone. It’s been seen on brand new machines right out of the box, it’s been seen on machines that have been running for 18 months as well. I’m looking for any kind of link here, but as yet don’t have one.
We are still digging into this issue and don’t have a solution, but I wanted to post this in case others came across it as well. Give it time… 20 to 30 minutes seems to be the timeframe. I’ll update this post with more info as I have it. If you’ve seen this then please share whatever info you have.