There’s no other way to put this, QNAP do not believe that a bug which risks corrupting your data on your QNAP NAS is worthwhile mentioning in their release notes at all.
Starting in January this year, we began investigating what I suspected was a bug in the QNAP firmware on their NAS devices. Previously we had seen random corruption on a number of QNAP NAS devices owned by our clients, but could not track why some files became corrupt. In late December, we had an issue occur which caused wide spread corruption and compromised the integrity of around 50TB of data on a single enterprise level NAS device. We lodged a case with QNAP and to be honest, their support was and still is abysmal. They would respond only every few days and only via email to what we considered to be a major critical issue. We escalated it to the highest levels within QNAP and even after all was done, had one of their BDMs come to our office to visit. Yet despite all this they do not get the importance of good customer relations.
Here’s the background for you.
We use StorageCraft ShadowProtect a lot for our client backups. It’s been proven to be a solid backup and recovery solution for us for many years. We typically store these backups onsite on QNAP devices with 4 drives in a RAID 5 configuration. Now in “normal” operation this works well. It’s only if a drive in the RAID array fails however did the QNAP device then have to resort to recalculate the missing data, and when it does that it makes errors in the calculations but does not tell you about it. These errors cause the data to be corrupted. Sure – you replace the failed drive, and it then used the same calculations to repopulate the failed drive, with corrupted data. Basically if you have a QNAP with a RAID 5 array AND you have a drive fail, you WILL HAVE an issue with data corruption.
You might ask – why has no one noticed this so much? Due to the bug it does not corrupt ALL data, just SOME data. If you used your NAS for photos/word documents etc., you might not notice a photo or document corrupted for years to come. However if you use it for backups with programs like ShadowProtect, then the StorageCraft ImageManager does an integrity check on the files on a much more frequent basis. This is how we discovered that there was a corruption issue. We had seen past experiences with clients having a drive fail, and then a few days later noticed that the ShadowProtect backups were playing up. We thought these were isolated incidences and didn’t connect the dots until we saw it in a much wider scale. Oh – having the QNAP device do an integrity check of the RAID array was not good enough as it uses the same flawed calculations and therefore thinks everything is fine.
We had a number devices have drives fail in quick succession – and on ALL of them, the ShadowProtect backups became corrupted in some way. In all across all the devices we had close to 100TB of data which had it’s integrity compromised. When we saw it first, we raised the issue with both StorageCraft and QNAP right away. StorageCraft were helpful in quickly determining that indeed the files were corrupted. Given we had another copy of some of the files we were able to copy over the missing bits of the backup and restore the image chain to working order. In some cases, due to clients choice, there were no other copies of the backups and therefore the client lost their entire backup chain, and had to start backups from scratch again.
During the investigation with QNAP – it took QNAP over 2 months before they would configure StorageCraft backups and simulate the issue, that was despite the fact that we had replicated it on our test environment for them in advance. It was not until end of March that they reported back to us that they had found the issue and fixed it in firmware to be released in April. They then further reported that they felt the issue was related to StorageCraft, however offered zero proof of this. I believe they felt the issue was related to StorageCraft only due to the fact that this was the program that highlighted the data was corrupt, and that if we had MD5 checksums of the word and other documents we could also have proven the corruption.
With our contacts at StorageCraft Australia – their lead Tech Guru – Jack Alsop was also heavily involved in this investigation. He reported that they too had seen a number of incidents where data corruption occurred on a QNAP NAS following a disk failure, however they too did not link the disk failure to the corruption until we brought them into the investigation.
A little googling shows that we are not alone and others are also experiencing similar issues. Look at this post here on Veeams forums where users report corruption after disk failure. Reading that post shows one user with Veeam seeing this after a disk failure, another user reports at least 3 times with QNAP seeing the issue after a disk failure, yet another reports that as of a few weeks ago, he’s seen the issue for at least 12 months on a disk failure and more concerning still, is a user with Synology NAS reports it. That makes me wonder if there are more issues with Linux based NAS devices out there that other vendors are seeing and not reporting. I can’t however comment on the Synology report as I’ve never seen one or used one.
So – following the release of the updated firmware that fixed the problems, we scoured the release notes for advise that this indeed fixed a potential data corruption issue. Nothing could be found then and even today, nothing is shown. We asked QNAP why they did not include this information as we really felt it was critical for ALL to upgrade to this fixed firmware given the potential data loss that could occur. QNAP’s response was lack luster to say the least, and as at end of June, they still fail to believe that this is worthy of public notice of any kind. This is quite concerning that a vendor which we’ve held in such high esteem as QNAP would decide to cover up an issue such as this. Why would they choose to place their clients data at risk when all they need to do is include a recommendation in the release notes along the lines of “This update resolves a potential data corruption issue should a disk in a RAID 5/6 array fail and is regarded as highly important”. A note like that would prompt most IT Professionals to get this update out to their clients ASAP. One can only wonder what other issues QNAP are fixing under the covers and not advising us about in their release notes.
All in all we spent over 400 hours investigating, proving and fixing this issue over the last 6 months, without even a thank you from them for the efforts expended. I’ve held off from posting this information and I’ve encouraged QNAP to update their release notes to reflect that this is a critical issue. However they have not done so and had not responded now for over a month until we prompted them a week ago, and now have gone silent again. I fear that they are too worried about public perception of their NAS to release what I see is critical information.
In short, if you have any QNAP running a version before 4.3.3.0154 20170413 or 4.2.5 20170413 then upgrade it immediately as you risk data loss should you have a drive in your RAID5/6 array fail. No amount of data scrubbing will recover the data if you have a disk failure and have not upgraded to these versions at a minimum.
Please share this message with other resellers and end users so they can protect their data too.
UPDATE 21/7/2017 – QNAP have now updated their release notes with the wording to indicate the severity of the issue, and I understand are issuing further communication to their channel today. I’ve asked QNAP to put this information at the top of their release notes so that it is apparent to all concerned that this issue is a serious one and one that the resellers/end users need to address by way of a firmware update. Thank you to Ripple Wu – Product Manager for QNAP for accepting that this issue is worthy of such action.
UPDATE 25/7/2017 – QNAP have fully accepted that this is a serious issue and made some significant changes in how they will communicate this information to the public. Read more about it here