[BUG] file corruption after read on some (not all!) mfsclient mounts #521

oszafraniec · 2023-01-13T11:44:32Z

MFS: 3.0.116 PRO
OS: Ubuntu 20 LTS and 18 LTS

Let me tell you a story of a problem...
We use MFS storage for a lot of big ZIP files that are generated by an app and downloaded later via FTP, unziped etc. We have app-servers as well as ftp-servers that have this storage mounted, so a few independent mount points to the same export.

Some time ago we noticed that some ZIP files are unziped not fully. Not all, some out of thousands downloaded. We thought that file corruption took place during FTP transfer, so we added "zip -T" test after downloading a file. If the test failed, we where downloading file again which was solving the problem. But after some time we noticed that even redownloading a file doesn't help and a file has to be redownloaded a few times. Network was checked, no problem here. So we've run the same test on the FTP server locally on the mfsmount point and it also failed. This was strange because the same "zip -T" test on other app-machine was OK. This was more strange when we found out that running this test a few times in a row we started to have one time OK one time Failed test result. Mfsfileinfo said that the file is VALID. After some time we found wich one mount/machine that was almost always giving us failed test and other OK. So we made a copy of a file with those machines and... the copy made by "failed" mount was corrupted and made by "OK" mount was "OK". But... after some time the "bad" file started to be OK. Magic ;)

I believe that it depends from which chunkserver the data is downloaded by mfsmount (we have 2+ copies). I believe that we have a chunkserver that is doing some data corruption, but can't say which one. Or mayby it's some glitch in mfsmount that is running long time processing a lot of data (avg. 1-4TB/day is read by it). Mayby some cache corruption?

We've changed a goal of a file from 2 to 4 to force MFS to reread chunks, check checksums etc... All is VALID all the time. Even after changing the goal still we have same behavior for a problematic file.

At the end... remounting mfsmount point usually solves the problem for some time.

Every time we where able to read a file without a problem from mfs. All the time filesize was the same. Only ZIP was complaning about the content of a file.

The question is how we can dig deeper to see what mfsmount is doing during read of such a file? Is there a way to compare from which chunkservers the data is downloaded? Does mfsmount checks checksums of chunks during read?

This is very rare situation but it shows up from time to time. Knowing how to diagnose it more would be helpfull so we can give more feedback on this.

borkd · 2023-01-13T12:39:47Z

Could be a stupid question, but are all servers and clients equipped with ECC RAM?

oszafraniec · 2023-01-16T09:08:39Z

@borkd no so stupid as the problem started some time ago and there was no change in MFS version etc in between
Clients are VMs, running on hosts with ECC RAM.
Masters have ECC.
Chunkservers... need to check them all but i think all also have ECC (Supermicro MBs and Xeon CPUs).

chogata · 2023-01-17T11:16:40Z

Are any of your machines 2-processor ones?

oszafraniec · 2023-01-17T11:24:16Z

@chogata yes, they have: mfs-master (2x CPU), vm-host (2x CPU)
Chunkservers have only 1 CPU

Just checked a file that I was checking few days ago... File was not modified in between... Just to show you what I've described above.

root@mfsmaster01:/moose/# zip -T some.zip.bad
error: invalid zip file with overlapped components (possible zip bomb)
test of 12101483.zip.bad FAILED

zip error: Zip file invalid, could not spawn unzip, or wrong unzip (original files unmodified)

and after some time...

root@mfsmaster01:/moose/# zip -T some.zip.bad 
test of 12101483.zip.bad OK

chogata · 2023-01-17T12:53:28Z

I'm inclined to blame your vm-host machine.

We had 2 confirmed cases with our clients and we researched this a bit and found other people having the same problem with different software. Basically, there is a problem, not very common, but existing, when a 2-processor machine fails to refresh processor (high level) cache in time. When this failure happens, a process reading data from a cache cell reads a previous value, not a current one, which leads to all sorts of strange and unexpected behaviours in software.

For some reason it happens more often (or even exclusively?) with processes that use "a lot" of RAM. I'm no expert, but I suspect than when a process has a lot of memory allocated and part of that memory is in address space managed by the 1st CPU and another part in the space managed by the 2nd CPU, the kernel is inclined to more often switch the process between CPUs (following the memory, perhaps?), thus increasing the chance of a failed processor cache synchronisation issue.

Our 2 clients with 2-processor machines had them working with different modules. In one case it was a chunk server, that had spectacular fails with core dumps, in the other it was also a client machine, like yours and also problems with data, but no failure of the process itself. Note, that the client with chunk server had more than one 2-processor machine, but only one failed in regular intervals. Which leads me to believe, personally, this might be a rarely spotted hardware issue. But it could also be software (kernel) related, as we never had a case of 2 machines with absolutely identical versions of system and kernel and only one of them failing. But in both cases we investigated long and hard using debugging software and the results were indisputable: one thread puts a value (that we know) in a certain memory cell, right after another thread reads that value and behaves in a way that tells us that it MUST have read something different, that the first thread wrote. And previous value residing in this memory cell always fit the bill (aka explained the otherwise unexpected behaviour).

It also aligns with what you wrote about remounting helping initially: just after remount your mount uses less memory (it did not allocate all those caches yet ;) ) and that memory is probably "orderly" and in "one piece". After a time it increases and also starts having a tendency to fragment.

borkd · 2023-01-25T01:45:16Z

Having a "fingerprint" of systems where such rare but significant issues were found might help with efforts to reproduce

master/cs/client
os / distro
kernel version
mainboard model/type
bios / firmware
RAM m/t and amount
storage controllers
nics (offload, trunking..)
number of cpus and their model/type
cpu firmware
vulnerability mitigation measures enabled in the running kernel or injected via 3rd party code

inkdot7 · 2023-01-25T07:28:24Z

But in both cases we investigated long and hard using debugging software and the results were indisputable: one thread puts a value (that we know) in a certain memory cell, right after another thread reads that value and behaves in a way that tells us that it MUST have read something different, that the first thread wrote. And previous value residing in this memory cell always fit the bill (aka explained the otherwise unexpected behaviour).

@chogata Was there enough memory fence instructions to ensure ordering between the loads and stores on each CPU?

Do I understand correctly:

at start, the memory location has the value 'c'
thread A writes a value 'a' to a memory location
thread B reads from that memory location, and gets 'c'

Something more is needed here - what tells that 3. is after 2.? I.e. with respect to what should the memory ordering instructions ensure ordering?

chogata · 2023-01-25T10:45:21Z

@inkdot7 in case of the crashing chunk server we traced core dumps, they showed us what happened, instruction after instruction, in the moments before the process crashed. So yes, those things happened in this order. It was more like:

at start, a memory location has value 'x'
thread A writes a value 'y' to this memory location
thread B reads a value from this memory location and behaves unexpectedly, for sure NOT like it would have read y, but IF it have read x, then the unexpected behaviour would make sense (but also a host of other values, not x and y, could cause that behaviour)

We concluded that it must have read x, after reading available materials in the net about similar problems.

To kinda answer your question: what should ensure the ordering? The compiler. We investigated this path too, but the cases of bad cache refreshing are rare enough it's hard to blame the code. We've also found some claims that using mmap may cause the problem, we don't really see how, but we reverted to malloc, just in case.

BTW, I forgot one more case, two processor client machine, had a very frequent problem with data integrity (file length values), that could only be logically explained by the "cache refresh problem". We described the possible cause to the company that owned this particular machine and they investigated. Turned out, one of the coolers in a sophisticated cooling system was malfunctioning and the processors' temperature was higher than usual, but not high enough to cause emergency shutdown, just some log messages (that nobody read ;) ). When they replaced the cooler, the machine "went back to normal", AKA - they never had the problem with incorrect values again on this machine.

inkdot7 · 2023-01-25T11:39:14Z

@chogata I do not think that ordering by the compiler alone is not enough. It also need to emit instructions to have the CPU not do things out of order.

Consider a shared memory area with two locations, a and b. E.g. a could be a flag or counter telling if the value b is valid for use or not.

At start, both locations are '0'.

Thread A does the following:

Write '2' to location b.
Write '1' to location a. (Telling that b is now valid.)

Thread B does the following:

Read location a. (To check if b is valid.)
Read location b.

And then it would e.g. only use the value from location b if the value from location a is '1'.

What are the possible read outcomes for thread B?

If running before A, it would get a=0,b=0.

If running after A, it would get a=1,b=2.

If thread B runs the code around the same time as thread A, then it could get a=0,b=2.

But it can also get a=1,b=0, even if the compiler has made sure to put the writes in A and reads in B in the given order. The processor memory model typically give it freedom to on-the-fly reorder memory operations, which also includes not enforcing the caches between the CPUs to immediately reflect updates in order.

On e.g. x86(64), there are the mfence, lfence and sfence instructions to tell the processor that either all (mfence) memory accesses, or just read or writes (lfence or sfence) need to performed in order. So if the codes above is changed to:

Thread A:

Write '2' to location b.
Execute sfence instruction.
Write '1' to location a.

Thread B:

Read location a.
Execute lfence instruction.
Read location b.

Then B will never see the case a=1,b=0.

Recently ran into a problem where an if-statement checking the value read from a before possibly doing the read of b was not enough. Without memory barrier instruction, the arm m2 CPU had sometimes already speculatively done the b read before the a read.

chogata · 2023-01-25T13:12:07Z

Okay, but "good coding practice" requires you to use locks to create scenarios like the above. So it would be:

Thread A:

obtain lock x
write to location b (some value, maybe 2)
write to location a (value 1 to say location b can be read now)
release lock x

Thread B:

obtain lock x
if there is 1 in a, read b
release lock x

And one would expect the compiler to make sure the locks are honoured. MooseFS code always uses locks when it writes to a memory fragment that can potentially be accessed by other threads. I maybe did not state that clearly, but there are other operations in between the read and write (very few, otherwise the cache would have been refreshed). Besides, when you trace the core dump, you see exactly what the processors did, instruction after instruction, so even if we did not use locks, we would see that certain operations were swapped. We traced operations that had happened and KNOW the order. And yet, the value read is not valid.
I know it's hard to believe, we sat 2 days 2 people analysing one core dump, because we could not believe it either at the start :)

oszafraniec · 2023-02-28T08:07:42Z

As a wrap of this issue...

@chogata just FYI, remounting solves the problem like you can see below. Only dropping file cache via sync; echo 3 > /proc/sys/vm/drop_caches doesn't solve the problem. Modifying a file (regenerate zip file in our case) also helps. I will try to provide you with any feedback I can from a user perspective. Now we have extra tests for files and can catch errors like this. Still, the scale of a problem is tiny around ~0,04% but noticeable in our case (100 vs 250k file downloads per month).

For now, we found a way to live with this and we've done some error handling on our side. Let's hope it will go away after OS/HW/MFS/etc upgrades in a future ;)

(file was read before from MFS mount and shows up as corrupted)
root@ftpgw:~# 
root@ftpgw:~# zip -T /mnt/12368428.zip 
error: invalid zip file with overlapped components (possible zip bomb)
test of /mnt/12368428.zip FAILED

zip error: Zip file invalid, could not spawn unzip, or wrong unzip (original files unmodified)
root@ftpgw:~# 
root@ftpgw:~# umount -v /mnt && mount -av
umount: /mnt (mfsmaster:9421) unmounted
/                        : ignored
none                     : ignored
/mnt                     : successfully mounted
root@ftpgw:~# 
root@ftpgw:~# zip -T /mnt/12368428.zip 
test of /mnt/12368428.zip OK
root@ftpgw:~# 
(now same file is OK)

borkd · 2023-03-02T16:31:55Z

@oszafraniec - can you share how are your VM clients configured (qemu config strings), and maybe some details of the hypervisor config, including networking?

chogata · 2023-03-13T12:24:13Z

Regarding hardware, I forgot to add that little tidbit earlier: one of the clients that had the biggest problem with inconsistent data, that pointed to the cache refreshing problems, checked out their machine that gave faulty readings (they had only one that always generated a problem). It turned out one of the cooler fans was broken and the inside temperature was a few degrees higher than normal. Not enough to shut down the machine, just to output some error messages in the logs (which nobody bothered to read ;) ). When they replaced this one fan, the machine "went back to normal", aka it never gave a faulty readout of data again... Better cooling and the cache refreshing problem went away.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] file corruption after read on some (not all!) mfsclient mounts #521

[BUG] file corruption after read on some (not all!) mfsclient mounts #521

oszafraniec commented Jan 13, 2023

borkd commented Jan 13, 2023

oszafraniec commented Jan 16, 2023

chogata commented Jan 17, 2023

oszafraniec commented Jan 17, 2023

chogata commented Jan 17, 2023

borkd commented Jan 25, 2023

inkdot7 commented Jan 25, 2023

chogata commented Jan 25, 2023

inkdot7 commented Jan 25, 2023

chogata commented Jan 25, 2023

oszafraniec commented Feb 28, 2023 •

edited

borkd commented Mar 2, 2023

chogata commented Mar 13, 2023

[BUG] file corruption after read on some (not all!) mfsclient mounts #521

[BUG] file corruption after read on some (not all!) mfsclient mounts #521

Comments

oszafraniec commented Jan 13, 2023

borkd commented Jan 13, 2023

oszafraniec commented Jan 16, 2023

chogata commented Jan 17, 2023

oszafraniec commented Jan 17, 2023

chogata commented Jan 17, 2023

borkd commented Jan 25, 2023

inkdot7 commented Jan 25, 2023

chogata commented Jan 25, 2023

inkdot7 commented Jan 25, 2023

chogata commented Jan 25, 2023

oszafraniec commented Feb 28, 2023 • edited

borkd commented Mar 2, 2023

chogata commented Mar 13, 2023

oszafraniec commented Feb 28, 2023 •

edited