New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] file corruption after read on some (not all!) mfsclient mounts #521
Comments
Could be a stupid question, but are all servers and clients equipped with ECC RAM? |
@borkd no so stupid as the problem started some time ago and there was no change in MFS version etc in between |
Are any of your machines 2-processor ones? |
@chogata yes, they have: mfs-master (2x CPU), vm-host (2x CPU) Just checked a file that I was checking few days ago... File was not modified in between... Just to show you what I've described above.
and after some time...
|
I'm inclined to blame your vm-host machine. We had 2 confirmed cases with our clients and we researched this a bit and found other people having the same problem with different software. Basically, there is a problem, not very common, but existing, when a 2-processor machine fails to refresh processor (high level) cache in time. When this failure happens, a process reading data from a cache cell reads a previous value, not a current one, which leads to all sorts of strange and unexpected behaviours in software. For some reason it happens more often (or even exclusively?) with processes that use "a lot" of RAM. I'm no expert, but I suspect than when a process has a lot of memory allocated and part of that memory is in address space managed by the 1st CPU and another part in the space managed by the 2nd CPU, the kernel is inclined to more often switch the process between CPUs (following the memory, perhaps?), thus increasing the chance of a failed processor cache synchronisation issue. Our 2 clients with 2-processor machines had them working with different modules. In one case it was a chunk server, that had spectacular fails with core dumps, in the other it was also a client machine, like yours and also problems with data, but no failure of the process itself. Note, that the client with chunk server had more than one 2-processor machine, but only one failed in regular intervals. Which leads me to believe, personally, this might be a rarely spotted hardware issue. But it could also be software (kernel) related, as we never had a case of 2 machines with absolutely identical versions of system and kernel and only one of them failing. But in both cases we investigated long and hard using debugging software and the results were indisputable: one thread puts a value (that we know) in a certain memory cell, right after another thread reads that value and behaves in a way that tells us that it MUST have read something different, that the first thread wrote. And previous value residing in this memory cell always fit the bill (aka explained the otherwise unexpected behaviour). It also aligns with what you wrote about remounting helping initially: just after remount your mount uses less memory (it did not allocate all those caches yet ;) ) and that memory is probably "orderly" and in "one piece". After a time it increases and also starts having a tendency to fragment. |
Having a "fingerprint" of systems where such rare but significant issues were found might help with efforts to reproduce
|
@chogata Was there enough memory fence instructions to ensure ordering between the loads and stores on each CPU? Do I understand correctly:
Something more is needed here - what tells that 3. is after 2.? I.e. with respect to what should the memory ordering instructions ensure ordering? |
@inkdot7 in case of the crashing chunk server we traced core dumps, they showed us what happened, instruction after instruction, in the moments before the process crashed. So yes, those things happened in this order. It was more like:
We concluded that it must have read x, after reading available materials in the net about similar problems. To kinda answer your question: what should ensure the ordering? The compiler. We investigated this path too, but the cases of bad cache refreshing are rare enough it's hard to blame the code. We've also found some claims that using mmap may cause the problem, we don't really see how, but we reverted to malloc, just in case. BTW, I forgot one more case, two processor client machine, had a very frequent problem with data integrity (file length values), that could only be logically explained by the "cache refresh problem". We described the possible cause to the company that owned this particular machine and they investigated. Turned out, one of the coolers in a sophisticated cooling system was malfunctioning and the processors' temperature was higher than usual, but not high enough to cause emergency shutdown, just some log messages (that nobody read ;) ). When they replaced the cooler, the machine "went back to normal", AKA - they never had the problem with incorrect values again on this machine. |
@chogata I do not think that ordering by the compiler alone is not enough. It also need to emit instructions to have the CPU not do things out of order. Consider a shared memory area with two locations, a and b. E.g. a could be a flag or counter telling if the value b is valid for use or not. At start, both locations are '0'. Thread A does the following:
Thread B does the following:
And then it would e.g. only use the value from location b if the value from location a is '1'. What are the possible read outcomes for thread B? If running before A, it would get a=0,b=0. If running after A, it would get a=1,b=2. If thread B runs the code around the same time as thread A, then it could get a=0,b=2. But it can also get a=1,b=0, even if the compiler has made sure to put the writes in A and reads in B in the given order. The processor memory model typically give it freedom to on-the-fly reorder memory operations, which also includes not enforcing the caches between the CPUs to immediately reflect updates in order. On e.g. x86(64), there are the Thread A:
Thread B:
Then B will never see the case a=1,b=0. Recently ran into a problem where an if-statement checking the value read from a before possibly doing the read of b was not enough. Without memory barrier instruction, the arm m2 CPU had sometimes already speculatively done the b read before the a read. |
Okay, but "good coding practice" requires you to use locks to create scenarios like the above. So it would be: Thread A:
Thread B:
And one would expect the compiler to make sure the locks are honoured. MooseFS code always uses locks when it writes to a memory fragment that can potentially be accessed by other threads. I maybe did not state that clearly, but there are other operations in between the read and write (very few, otherwise the cache would have been refreshed). Besides, when you trace the core dump, you see exactly what the processors did, instruction after instruction, so even if we did not use locks, we would see that certain operations were swapped. We traced operations that had happened and KNOW the order. And yet, the value read is not valid. |
As a wrap of this issue... @chogata just FYI, remounting solves the problem like you can see below. Only dropping file cache via For now, we found a way to live with this and we've done some error handling on our side. Let's hope it will go away after OS/HW/MFS/etc upgrades in a future ;)
|
@oszafraniec - can you share how are your VM clients configured (qemu config strings), and maybe some details of the hypervisor config, including networking? |
Regarding hardware, I forgot to add that little tidbit earlier: one of the clients that had the biggest problem with inconsistent data, that pointed to the cache refreshing problems, checked out their machine that gave faulty readings (they had only one that always generated a problem). It turned out one of the cooler fans was broken and the inside temperature was a few degrees higher than normal. Not enough to shut down the machine, just to output some error messages in the logs (which nobody bothered to read ;) ). When they replaced this one fan, the machine "went back to normal", aka it never gave a faulty readout of data again... Better cooling and the cache refreshing problem went away. |
MFS: 3.0.116 PRO
OS: Ubuntu 20 LTS and 18 LTS
Let me tell you a story of a problem...
We use MFS storage for a lot of big ZIP files that are generated by an app and downloaded later via FTP, unziped etc. We have app-servers as well as ftp-servers that have this storage mounted, so a few independent mount points to the same export.
Some time ago we noticed that some ZIP files are unziped not fully. Not all, some out of thousands downloaded. We thought that file corruption took place during FTP transfer, so we added "zip -T" test after downloading a file. If the test failed, we where downloading file again which was solving the problem. But after some time we noticed that even redownloading a file doesn't help and a file has to be redownloaded a few times. Network was checked, no problem here. So we've run the same test on the FTP server locally on the mfsmount point and it also failed. This was strange because the same "zip -T" test on other app-machine was OK. This was more strange when we found out that running this test a few times in a row we started to have one time OK one time Failed test result. Mfsfileinfo said that the file is VALID. After some time we found wich one mount/machine that was almost always giving us failed test and other OK. So we made a copy of a file with those machines and... the copy made by "failed" mount was corrupted and made by "OK" mount was "OK". But... after some time the "bad" file started to be OK. Magic ;)
I believe that it depends from which chunkserver the data is downloaded by mfsmount (we have 2+ copies). I believe that we have a chunkserver that is doing some data corruption, but can't say which one. Or mayby it's some glitch in mfsmount that is running long time processing a lot of data (avg. 1-4TB/day is read by it). Mayby some cache corruption?
We've changed a goal of a file from 2 to 4 to force MFS to reread chunks, check checksums etc... All is VALID all the time. Even after changing the goal still we have same behavior for a problematic file.
At the end... remounting mfsmount point usually solves the problem for some time.
Every time we where able to read a file without a problem from mfs. All the time filesize was the same. Only ZIP was complaning about the content of a file.
The question is how we can dig deeper to see what mfsmount is doing during read of such a file? Is there a way to compare from which chunkservers the data is downloaded? Does mfsmount checks checksums of chunks during read?
This is very rare situation but it shows up from time to time. Knowing how to diagnose it more would be helpfull so we can give more feedback on this.
The text was updated successfully, but these errors were encountered: