There’s no other way to put this, QNAP do not believe that a bug which risks corrupting your data on your QNAP NAS is worthwhile mentioning in their release notes at all.
Starting in January this year, we began investigating what I suspected was a bug in the QNAP firmware on their NAS devices. Previously we had seen random corruption on a number of QNAP NAS devices owned by our clients, but could not track why some files became corrupt. In late December, we had an issue occur which caused wide spread corruption and compromised the integrity of around 50TB of data on a single enterprise level NAS device. We lodged a case with QNAP and to be honest, their support was and still is abysmal. They would respond only every few days and only via email to what we considered to be a major critical issue. We escalated it to the highest levels within QNAP and even after all was done, had one of their BDMs come to our office to visit. Yet despite all this they do not get the importance of good customer relations.
Here’s the background for you.
We use StorageCraft ShadowProtect a lot for our client backups. It’s been proven to be a solid backup and recovery solution for us for many years. We typically store these backups onsite on QNAP devices with 4 drives in a RAID 5 configuration. Now in “normal” operation this works well. It’s only if a drive in the RAID array fails however did the QNAP device then have to resort to recalculate the missing data, and when it does that it makes errors in the calculations but does not tell you about it. These errors cause the data to be corrupted. Sure – you replace the failed drive, and it then used the same calculations to repopulate the failed drive, with corrupted data. Basically if you have a QNAP with a RAID 5 array AND you have a drive fail, you WILL HAVE an issue with data corruption.
You might ask – why has no one noticed this so much? Due to the bug it does not corrupt ALL data, just SOME data. If you used your NAS for photos/word documents etc., you might not notice a photo or document corrupted for years to come. However if you use it for backups with programs like ShadowProtect, then the StorageCraft ImageManager does an integrity check on the files on a much more frequent basis. This is how we discovered that there was a corruption issue. We had seen past experiences with clients having a drive fail, and then a few days later noticed that the ShadowProtect backups were playing up. We thought these were isolated incidences and didn’t connect the dots until we saw it in a much wider scale. Oh – having the QNAP device do an integrity check of the RAID array was not good enough as it uses the same flawed calculations and therefore thinks everything is fine.
We had a number devices have drives fail in quick succession – and on ALL of them, the ShadowProtect backups became corrupted in some way. In all across all the devices we had close to 100TB of data which had it’s integrity compromised. When we saw it first, we raised the issue with both StorageCraft and QNAP right away. StorageCraft were helpful in quickly determining that indeed the files were corrupted. Given we had another copy of some of the files we were able to copy over the missing bits of the backup and restore the image chain to working order. In some cases, due to clients choice, there were no other copies of the backups and therefore the client lost their entire backup chain, and had to start backups from scratch again.
During the investigation with QNAP – it took QNAP over 2 months before they would configure StorageCraft backups and simulate the issue, that was despite the fact that we had replicated it on our test environment for them in advance. It was not until end of March that they reported back to us that they had found the issue and fixed it in firmware to be released in April. They then further reported that they felt the issue was related to StorageCraft, however offered zero proof of this. I believe they felt the issue was related to StorageCraft only due to the fact that this was the program that highlighted the data was corrupt, and that if we had MD5 checksums of the word and other documents we could also have proven the corruption.
With our contacts at StorageCraft Australia – their lead Tech Guru – Jack Alsop was also heavily involved in this investigation. He reported that they too had seen a number of incidents where data corruption occurred on a QNAP NAS following a disk failure, however they too did not link the disk failure to the corruption until we brought them into the investigation.
A little googling shows that we are not alone and others are also experiencing similar issues. Look at this post here on Veeams forums where users report corruption after disk failure. Reading that post shows one user with Veeam seeing this after a disk failure, another user reports at least 3 times with QNAP seeing the issue after a disk failure, yet another reports that as of a few weeks ago, he’s seen the issue for at least 12 months on a disk failure and more concerning still, is a user with Synology NAS reports it. That makes me wonder if there are more issues with Linux based NAS devices out there that other vendors are seeing and not reporting. I can’t however comment on the Synology report as I’ve never seen one or used one.
So – following the release of the updated firmware that fixed the problems, we scoured the release notes for advise that this indeed fixed a potential data corruption issue. Nothing could be found then and even today, nothing is shown. We asked QNAP why they did not include this information as we really felt it was critical for ALL to upgrade to this fixed firmware given the potential data loss that could occur. QNAP’s response was lack luster to say the least, and as at end of June, they still fail to believe that this is worthy of public notice of any kind. This is quite concerning that a vendor which we’ve held in such high esteem as QNAP would decide to cover up an issue such as this. Why would they choose to place their clients data at risk when all they need to do is include a recommendation in the release notes along the lines of “This update resolves a potential data corruption issue should a disk in a RAID 5/6 array fail and is regarded as highly important”. A note like that would prompt most IT Professionals to get this update out to their clients ASAP. One can only wonder what other issues QNAP are fixing under the covers and not advising us about in their release notes.
All in all we spent over 400 hours investigating, proving and fixing this issue over the last 6 months, without even a thank you from them for the efforts expended. I’ve held off from posting this information and I’ve encouraged QNAP to update their release notes to reflect that this is a critical issue. However they have not done so and had not responded now for over a month until we prompted them a week ago, and now have gone silent again. I fear that they are too worried about public perception of their NAS to release what I see is critical information.
In short, if you have any QNAP running a version before 4.3.3.0154 20170413 or 4.2.5 20170413 then upgrade it immediately as you risk data loss should you have a drive in your RAID5/6 array fail. No amount of data scrubbing will recover the data if you have a disk failure and have not upgraded to these versions at a minimum.
Please share this message with other resellers and end users so they can protect their data too.
UPDATE 21/7/2017 – QNAP have now updated their release notes with the wording to indicate the severity of the issue, and I understand are issuing further communication to their channel today. I’ve asked QNAP to put this information at the top of their release notes so that it is apparent to all concerned that this issue is a serious one and one that the resellers/end users need to address by way of a firmware update. Thank you to Ripple Wu – Product Manager for QNAP for accepting that this issue is worthy of such action.
UPDATE 25/7/2017 – QNAP have fully accepted that this is a serious issue and made some significant changes in how they will communicate this information to the public. Read more about it here
Ripple Wu says
Dear Wayne
I am Product Manager Ripple from QNAP System Inc. While this is a none-official reply, I thank you for your team to have time investigate and highlight the issue to us. I also apology if you did not have a satisfied experience with our related support processes. As I have checked, we have already reviewing our release note draft with you, and we will improve and release it as soon as possible.
It is confirmed that our storage system does cause hidden data integrity issue on parity block. As our development team investigated, the root cause is related to the compatibility between our system operation and a generic Linux option called “Skip_Copy”, which have be enabled at 2016 June. After the issue can be reproduced in our lab and is confirmed can cause data corruption along with Storage Craft Shadow Copy, we have now disabled it in all our released 4.2 and 4.3 firmware after 2017 Feb.
Despite we have only received officially support request related to this issue from user that uses Storage Craft, we do agree with you that this option can potentially cause hidden data corruption under more general conditions when a RAID is in degraded mode, and should receive a higher level of attention and be addressed in the release note when fixed. While I believe our support’s indication that the issue was related to Storage Craft is not intended, as mentioned, now we will notify our users to update their firmware to 4.3 to avoid possible data loss with the upcoming official PR.
I cannot comment on whatever this issue is also the root cause for other cases you reported, or the condition is applicable to all Linux-base storage system, still, it is our responsibility to ensure that our product can protect the customer’s data, and now we have also now start to actively enable the RAID scrubbing features for our user to check and repair the data integrity issues cause or not cause by this case.
I thank you again for having this post, please kindly let me know if you have any further suggestion or find any other storage issue, and free feel to contact me directly by [email protected], we will help in any case to ensure that our product can better protect our customer data.
Thanks
regards
Ripple
Chris says
Do you have any reason to believe this could affect RAID-1 arrays on QNAPs as well? We use ShadowProtect and have been fighting unexplained data corruption reported in ImageManager. We use 4-bay QNAPs with 2 x RAID-1 arrays.
Wayne Small says
Hi Chris, I can’t be certain as I have never deployed using RAID-1/Mirrored Array, however based on what technical information I have from QNAP it might. I would upgrade either way to be sure. Sorry I can’t be more definitive.
To be clear however, with my testing, the data is fine right up until a disk failure occurs, and it is at that point that the corruption occurs as the RAID software needs to recalc the missing data (my terminology) and it does so incorrectly. If you have a QNAP and do not have a disk failure then the data is for the moment good.
Wayne
Chris says
Thanks for taking the time to reply! I appreciate your insight as well as that of the QNAP manager.
Wayne Small says
No problems, I’m glad this info has helped.
Wayne
Moogle says
Hi Wayne ,
Firstly thk you for bringing this to light and also using your clout for pursuing this matter.
Anyway would this affect running raid scrubbing? just wondering :X
Wayne Small says
Hi Moogle,
Not sure I carry much clout to be honest, but I’m glad that this post has brought to light the issues that many experience and ultimately are getting some traction within QNAP. End result is everyone wins.
I am of the understanding that this bug would not affect RAID scrubbing process.
Wayne
Moogle says
awesome thank you 🙂
I did not have any hdd failure during that period, but i did have stuff like
– raid scrubbing
– file system check (after powercut)
so wasn’t sure if these sorts of things would be affected by this, guess not.
phew
Wayne Small says
Hi Moogle,
Glad to have helped. I’m doing an article on RAID Scrubbing now, as I’ve received a number of requests for it. Will post in next 24 hours.
Wayne
Jon says
Wayne,
Where did you publish the scrubbing article?
Wayne Small says
Jon – I have not yet completed it – will do so hopefully tonight. Work has been a bit crazy the last few days. Apologies.
Wayne Small says
Chris – Check out the response from QNAP in this thread – I had not seen this response when I posted my comment above as it got trapped in my SPAM queue.
Wayne
Diego Azevedo says
While I can’t comment on this specific issue, I will take the opportunity to mention about 2 QNAP support experiences I had. Similarly to yours, they were not good.
We had a pool filling up with snapshots that couldn’t be removed. There were no snapshots listed on the pool/volume yet it was reporting over 50GB of snapshots. I created a case with QNAP (first case online and second case from the NAS itself) and support requested remote support to be enabled on the device. After about a week they came back to me requesting access to be extended as it had expired. No other follow up emails or phone calls. Tired of waiting on them, and having to get it working again, I migrated the data off and destroyed the pool to get it fixed.
The second case was exactly like the first one – same issue, but this time I decided to wait a bit longer before fixing it myself, as I wanted to find the root cause for the error.
All I received from support were emails requesting remote support to be enabled/extended/extented again. Nothing else. After about two or three weeks I had no choice but workaround the issue myself and this time disable snapshot on the volumes.
I had the opportunity to express my dissatisfaction on the survey after closing the case, but again never heard back from QNAP.
dhanushka says
Thanks for the post and information Wayne.
As you mentioned QNAP support is really bad, and they do not care about the customer data. I have had the pleasure of doing work with them and like your case, it took months to get correct information(not a resolution) because of lack of response and responsibility from QNAP support.
I would never recommend to go with QNAP in a business envo\ironment.
Wayne Small says
I too have concerns about their support, but want to work with them to make it better if they are willing to do so.
Wayne
Ripple Wu says
Dear Wayne
I am Product Manager Ripple from QNAP System Inc, this is a none-official reply. After continued verification and discussions, the team has now not only updated the release note to include this important fix:
https://www.qnap.com/en/releasenotes/index.php?cat_choose=&p=3
but also cooperated with different departments to provide a “Technical Advisory” to promote this and any future important technological advice to our user. Such advice will include all the detail of an identified issue such affected products range, summary and suggested solutions and be announced on our website, sent to related distributors and promoted to related users as soon as released.
https://www.qnap.com/en/technical-advisory/tec-201707-01
While the team surly will move forward to understand how to more actively disclose an important fix and promote such information to our users, again me and the team thank your contribution to have this post and all the related efforts so that we can continue improve our processes to ensure the integrity of the data that users entrusted to our products.
Thanks
Regards
Ripple
Kevin says
Thanks for investigating this on everyone else’s behalf – I hadn’t been bitten by the bug, but easily could have been if any drive failed on my raid 5 arrays.
StanG says
It’s sad that this kind of public action is needed to make vendors accept their responsibility.
networker83 says
Thanks for your good work! I have got a question: I installed a new 8-bay NAS in January 2017 (TS-831X). The first firmware was 4.2.2 build 20161214. I did a RAID5-array and 2 RAID1-arrays.
Since then I updated the firmware to each new version (4.2.3, 4.3.2, 4,3.3).
In June 2017 I replaced all my HDDs with bigger ones. While I did this, there was installed firmware 4.3.3.0210 build 20170606. After unplugging each old HDD, the RAID was degraded and rebuild process started (I did this 4 times because I installed 4 new HDDs).
I did the RAID scrubbing some days ago with firmware 4.3.3.0238 build 20170703 and no errors were found.
Now I am concerning if some of my files could be corrupted and if I should take my backup and restore all files. Could you please tell me if I am affected with this problem? Thanks!
Wayne Small says
If you had 4.3.3 later than April installed BEFORE you forced the RAID into degraded mode then you will be safe from the bug that I experienced. It looks like you were above that when you replaced the drives. The RAID scrubbing is there to cross check the array and will NOT fix the data that was corrupted by the bug I experienced, but WILL fix other silent corruption issues (which I’ve never experienced).
Hope that helps – Wayne
Stephen says
Thanks for getting to the bottom of this. Lost 10TB of backups last year, QNAP denied everything pointing the finger at Storagecraft and while Storagecraft performed troubleshooted with me, pointed the finger back at QNAP. We had no hard evidence, but we decided to stop selling QNAP devices.
Wayne Small says
I wish I didn’t have to do the digging that we did, but it seems to have made a major difference to a number of people which is good. Sorry you had to go through this issue too mate. 🙁
Trevor Harte says
Hi Wayne,
As discussed, Synology has announced recently that they also have updated firmware to mitigate this RAID rebuild flaw.
https://forum.synology.com/enu/viewtopic.php?t=135059
Wayne Small says
Thanks Trevor for letting me know about this – have just done a post here on the issue!
http://sbsfaq.com/?p=4321
Nishant Vora says
Dear Wayne
We had a similar file system corruption issue yesterday and found that nobody could write onto the QNAP file server. Our Network Support company informed us that disks were healthy and that they had managed to recover all the data.
Your post is very informative and helpful. Appreciate you going after QNAP and ensuring that they take responsibility for their product especially when so many people are storing valuable commercial data on their product.
I don’t think our network support company had updated the firmware based on what you pointed out.
We will be reviewing the recovery based on your post and watching out for lost/unrestored data that might be erroneously left out.
Once again thanks for the efforts.
Wayne Small says
Glad that my experiences could have helped others. It certainly was a troublesome issue for myself and many!
David Radtke says
Very timely I found this thread.
I am just about to purchase a NAS for my home/office mainly for archiving purposes and “emptying” all the desktop disks to centralise all data.
I had settled on a QNAP 431p2 over a WD consumer NAS, but after reading this I am a bit concerned as I was aiming to set up a RAID 5 array purely for its inherent fault tolerance. Also the post that mentions Synology also has a similar problem is not reassuring at all.
Question , would you still supply a QNAP 4bay in RAID5 today?
Are there any reservations?
Am I better off just putting a big drive in one machine and mirroring it?
Trevor says
QNAP and Synology have both patched the bug and should be safe to run raid 5.
I was putting off an expansion from raid 1 to raid 5 until the fix was confirmed, and have since implemented raid 5. You should be ok so long as you run the latest firmware.
Howard Walker says
Hi Wayne,
I know that this is very late to the artical but I also wanted to thank you for finding this issue and getting a resolution to it.
I have a TS-470 Pro that is on 4.2.2 20161208 and is reporting read errors on drive 4 of the RAID5 setup over the past few days though it’s not yet degraded the array. I am running a “Scan now” on the drive to see if it turns anything up and if it and planned to replace the drive in the morning (UK) if it was still giving errors.
I think you have just saved my bacon as I will now do a firmware update before changing the drive.
Would this bug cause issues if I did the drive replacement using the capacity expansion function ?
Could you advise on the upgrade path ? Should I just go stright to 4.3.4 or should I progress through some intermediate versions ?
Many thanks
Wayne Small says
Hi Howard,
Apologies for the delayed response. Head direct to the latest Firmware BEFORE replacing the drive or you will have issues.