-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BTRFS - most unfriendly to ssd/nvme/sd-cards in cmp to any of other FS. Write Amplification #760
Comments
btrfs isn't really suitable for (large) databases or VMs. Some databases (eg postgres) when run on btrfs, explicitly switch data files to nodatacow mode, also virt-manager turns on nodatacow for VM images. Nodatacow helps, but in my opinion it is better to use a different filesystem for such stuff (eg, xfs has reflinks too and is generally well-suited for this kind of task). btrfs is much better for general-purpose usage. FWIW. |
As you clearly found out, one of the worst case scenarios for current Btrfs design is an fsync after each small write. On ZFS, which is also a COW filesystem, this specific situation is handled via the ZFS Intent Log (ZIL) . Synchronous writes are first stored into the ZIL, then a background task moves the data into permanent storage. ZFS can further store the ZIL on a separate device (SLOG) to further increase performance. I guess you could like ZIL to Write-Ahead-Logging (WAL) in databases such as sqlite and Postgres. Whether this would be a possible solution for Btrfs, I really don't know. I believe that in Btrfs, synchronous writes are stored in the tree log. Perhaps there is some opportunity to improve this. There is already the "preferred metadata patches" that allows one to choose different devices for data and metadata workloads. You also mention logging. I can ot quite understand what you mean, but in Btrfs you can use the Finaly, @DaLiV, I think your message is rude and harsh. While Btrfs has many performance issues, being rude about it is not helpful to anyone. As a last point. I cannot see that the script provided includes writes on ZFS caused by the ZIL. Commit background thread when it moves data to final storage. |
There is no new information here. This issue is as old as btrfs. https://ar5iv.labs.arxiv.org/html/1707.08514 has a good summary of the theory. They posit that it is impossible to fix the write amplification without creating a new issue somewhere else in the filesystem, such as storage amplification or lost data integrity. At best, you could design a filesystem where write amplification is reduced, but read amplification is increased as a consequence. The authors also posit that this is a better tradeoff for a filesystem on devices where reads are cheaper than writes, and they are probably right, but the only way a btrfs user can benefit from this information is to use it to select a replacement filesystem. Until that filesystem comes along, the write amplification is simply something we have to learn to live with. btrfs does have support for nodatacow files, but enabling nodatacow requires turning off other btrfs features such as data integrity and snapshots, which reduces the write amplification at the expense of data integrity. btrfs still has fairly large write multiplication in this configuration because nodatacow only applies to data, so metadata updates like inode timestamps are still relatively large. The result is never faster than creating a separate block device and running ext4 or xfs on it--after every possible optimization is done, btrfs metadata is still an order of magnitude larger, extent for extent, than the metadata on ext4 or xfs, and it's simply more iops to push that much extra metadata out to the drive and back. |
additionally can say - already has some faults with "fault tolerance" by power outages ... "last revision in fs header is not matching to the last in tree ..." which also unsolvable. was needed to do full restore ... that means that FS is unstable by "long not writing times" and kills drives at "fsyncs" ... so metadata of btrfs is very fragile ... so can conlude that all sayed here confirms that btrfs is badly buit for non-rotational drives with limited rewrite-counts.
sadly we have not so many COW filesystems. and i hope that it can be improved in some future versions.
full sync, unmount, and export. |
@DaLiV what is your goal with your posts? I don't see anything constructive with your remarks. Btrfs has many feature other Linux filesystems do not, which to me are more valuable than the added cost of metadata updates. |
constructive - btrfs need improvements which eliminates write amplification, which in fact kills good part of "pros" for "normal usage". |
You can use I'm not sure how that applies for other features but assume that for the point in time of the snapshot, you'd have access to other related features at least, they're just not as useful in-between snapshots. Regarding reflinks, I'm not sure if much has changed since 2021 but @Zygo and @Forza-tng were both involved in this 2021 mailing list discussion thread (long and full of technical details), while this one in 2022 mentions reflinks are supported for
That's dependent on what your priorities for a filesystem are. As you should be aware BTRFS has features that other filesystems lack, there are some tradeoffs that come with that so you'll want to weigh up which filesystem is appropriate for the context of the hardware available and workload you need to support. You may be better served by F2FS or even EROFS for such a storage device where writes are a concern. BTRFS can still work well if it has features you need, but may be complimented by other solutions if it adds friction regarding some workload requirements you have regarding write activity, such as:
It depends on your workload requirements. If you want to run with datacow, you can and depending on the context doing so may not be a concern for you. As mentioned above, you can still leverage snapshots, reflinks and deduplication with nodatacow. A DB should have a variety of settings for you to decide on when to favor a feature available from both the filesystem or DB, with the
Compression + Deduplication should minimize storage concerns?
I am a bit rusty on this, but an SSD has a blocks of a larger size, that represents a group of pages/sectors typically of the size you have mentioned (although you can get larger, and physically it can be 4K pages but 512e via firmware exposes it to the OS as a smaller size). I've also heard that some hardware internally dedupes pages, and it's not uncommon for a faster write cache of X capacity which the controller could probably optimize some writes at the hardware level, thus actual wear is not necessary as bad as you may assume with modern hardware. Then there is the filesystem layer that manages it's own layer typically in a similar fashion. A write to disk for an SSD can be spread across physical pages and blocks, while on the filesystem layer it's potentially treated as contiguous or fragmented across multiple extents. Each fragment gets queried to the disk to retrieve for a read of a file for example, that's where extra overhead can be introduced with regards to IOPS. You also have the kernel providing some generic filesystem agnostic features that can coalesce some I/O within a buffer to make that more efficient, with similar offered by the filesystem, disk controller hardware and in some cases the application software too (like DBs/VMs). Those small chunks you refer to can then be possibly optimized into operations that perform better. This was already touched on in earlier responses.
BTRFS doesn't have to be a hammer that you use for everything with same settings throughout. You have many options available either within in BTRFS or opting for an alternative filesystem when it serves your workload requirements better. Defaults cannot accommodate everyone, when it matters to you (performance / durability) you should be in a position to better understand what you're working with and the tunables available to you :) |
tested nodatacow - not a "gamechanger"
mostly no - take as example "email" data - every of them is diffirent, or word/excel dociment with changed 1 letter inside gives "completely another" file - what gives impossibility them "deduplicate" so every will have amplification. for big-vm data, which are "cloned" from reference - that was mentioned also "single block overwrite" leads to the same results ... and suggested to "better other FS" ... so for which type of general-purpose that will be not an issue ? as i already mentioned - all that is (SSD|NVME|FLASH)-related and not a problem for Rotational-HDDs ( im last supposed type is CMR of hdd , how good on SMR-type is other question as then related wholely on HW specific - and are not for this topic ) |
Compression should be notably smaller than uncompressed for text content. If the overhead you are concerned about is minimized so that it is more comparable to the size you see allocated on another filesystem (which likely lacks compression), what is the issue? It may be even less disk. Deduplication doesn't have to be paired with compression, depends on your workload. AFAIK both features don't operate on the entire file, but small blocks / extents. So you should still find that this can work well compared to a filesystem that lacks the feature.
Reflink copies will not use extra space, you share the extents. Then only new writes use disk space. This is available in BTRFS or another filesystem like XFS, but others do not support it, so it really depends what you're comparing to. If BTRFS does not suit your workload needs, that's ok you can choose another filesystem. BTRFS like other filesystems is not meant to be best in class for every workload. You choose it for the features it has available, performance and overhead concerns are not always the higher priority in what filesystem you choose, if they are for you then another filesystem may meet your needs better :) |
When testing nodatacow vs datacow, remember that the inode update for the mtime timestamp will be far larger than the data in a single 16K write. If the test is 1000 random 16K writes on a nodatacow file (with no Note that metadata update costs go up with the size of the tree, so these overheads are about 3x smaller on a 100 GiB filesystem vs. a 100 TiB one. An effective way to reduce writes is to use datacow files, mount the filesystem with
btrfs does check for common device firmware bugs, so you will not get silent data corruption on power failures. The flip side of that is that btrfs has zero tolerance for device firmware bugs that affect its metadata integrity--btrfs can tell precisely when and where the device has lied about its data integrity, and knows when it cannot trust the device any more--so any failure causes btrfs to come to a very loud and complete stop. If the device fails in this way, it is usually necessary to fix the device (i.e. disable write cache, replace the device with a different vendor/model/firmware, or add a raid1 mirror drive with better firmware) before it is usable with btrfs. If you run a different filesystem on those devices, the devices will corrupt the data on power failure, and if the other filesystem doesn't have data csums, you won't know the data is corrupted unless an application tells you.
Nitpick: email data can be deduplicated with 4K granularity on btrfs, and there is a remarkable amount of duplication in real mail stores (thank Microsoft's block-oriented document formats for that--as long as the attachments are uncompressed). On the other hand, that fact doesn't help with write amplification in any way. Deduplication definitely does not reduce writes on btrfs. The duplicated data must be written to disk first--if it's still in page cache, btrfs dedupe will first flush both copies of the data to disk, then compare and delete the deduplicated data in a separate metadata update. Only the total size is reduced. This is due to a somewhat literal interpretation of the requirements for the Compression trades data size for metadata size and dramatically reduces the maximum size of each extent (i.e. it proportionally adds more metadata per byte of data stored). It would only help significantly if you're writing a lot of files, each file is compressible, and each file fits into a single compressed extent (i.e. 128K or less). That fits the profile of a source checkout and maybe a build--but a build's write workload might be over 90% metadata updates, and there's no way we're getting 90% compression to cancel that out.
The problem with write amplification is usually seen as a lower performance ceiling rather than early device failure. Users who don't hit the performance ceiling are not likely to hit the end of the device lifetime too early as that will require writing at high rates for a long time. Modern consumer SSDs can handle hundreds of terabytes, if not petabytes of writes, before their warranty specs are exceeded--and they usually continue operating far in excess of that. The write amplification is not a concern for longevity unless we're hitting double-digit DWPD, or we're using a specialized device with very low endurance (i.e. a datacenter "boot" drive, which is optimized for cost and has extremely low write endurance) or a firmware bug (e.g. the short lives of Samsung 980 PRO devices). As a rule of thumb: if the workload requires |
However SSD also has SSD_Life_Left / Wear_Leveling_Count attributes , which still means "replace me" and that is "overwrite-count" dependent as going down even "Available_Reservd_Space" stay same and can not be treated as not "Pre-fail".
So what you can deduplicate ?
try to compress "already compressed" like pdf jpeg docx xlsx, which are "mainline" of not-database workloads, so that can be treated also as one "mythical animal"
that i use ...
may be possible somehow to change behaviour from "full-BTree-points-rebuild" to "Blocks of- Double Linked Lists", what gives overwrite of lower block count per write
that going over many btree chunks? - are that may be turned off ? likewise as noatime |
As for empirical data goes....:
Approximately 2GiB saved by deduplication. An additional 3.3GiB saved by compression. A total savings of 5.3GiB, or 38%. Though deduplication is done after mail is received, so the compression savings on initial write is larger than 3.3 GB. |
There's a number of ways to get there:
Note that the current behavior does one 160-200K write for metadata tree update every 30 seconds. Your test might not reflect that if it does only one write during the test. The timing between writes affects the total amplification--there is write-combining even with I wouldn't expect a drastic improvement from this except under very specific workloads (many files, 16K write to each, no more than once every 30 seconds). In other cases (e.g. thousands of writes to a single VM or database file within 30 seconds), the improvement from eliding mtime will be a very small part of the overall write load. 400K per minute is 0.98 TB TBW over 5 years--not even 0.5% of modern SSD endurance capacity, so you'd have to make that saving happen hundreds of times every minute of every day to increase wear lifetime by 50%.
We find that attribute has no predictive value for device performance or failure within the first 5 years, so we ignore it. We run the drives until normalized wear is 100%, then we keep running the drives for years after that. Some of them fail, but there's no correlation with life_left when they do (age is a much better predictor). SSD failure and especially SD card failure often starts with silent data corruption, which makes btrfs + datacow + datasum essential for failure detection (or you have to exclude the bottom two thirds of the SSD market to get to devices that reliably report UNC errors instead of garbage data).
You seem to misunderstand--I'm speaking from experience of btrfs dedupe on email stores. It does require a mail store that separates headers and message bodies and stores the latter block-aligned (i.e. not mbox format). The message bodies are frequently byte-for-byte duplicates, and there's a dedupe hit every time someone CC's more than one of the users. It's not great performance--a dedicated mail store that can isolate attachments and delta-compress email threads is far more efficient than filesystem-level block dedupe--but it does work, and it can get a double-digit percentage reduction without changing the standard mail software. None of this is relevant against write amplification, since btrfs dedupe always increases total write counts compared to not using dedupe. This is why I annotated the dedupe correction with the "Nitpick" label - I corrected a statement that was not factually accurate about dedupe, but also not relevant to the topic, whether accurate or not. For dedupe to be relevant to reducing write amplification, a write-eliding dedupe implementation must be added to btrfs. The one started in 2014 was abandoned. You're welcome to make another attempt.
That sounds like a higher read amplification to pay for lower write amplification--which is exactly the tradeoff I pointed out at the start of the thread. Note that a double-linked list implies a minimum of three block updates, the same as a 3-level btrfs tree; however, a tree shares its interior nodes while a double-linked list does not. It might be better or worse than tree updates for multi-block updates depending on the details of the commit. It's worse for reads because a tree's search amplification can't be more than about 3, while a double-linked list can be arbitrarily large (useful for journalling but not much else). There is some opportunity to reduce the overhead a little in extent_tree_v2, but so far it looks like the minimal tree update is still 3 pages in 3 trees (144 KiB). e_t_v2 might drop additional updates over that. We can also replace tree updates with a journal and update-in-place, which some researchers have tried, but without significantly reducing write amplification. The main problem is that we need to store uncommitted tree updates somewhere searchable; otherwise, we can't read the filesystem until the journal is finished updating. That means we keep a big update tree in memory, which limits its size and adds latency to writes, or we have a mini-filesystem acting as a cache inside btrfs--and that cache eats the write amplification savings on the main filesystem tree. bcachefs has an interesting alternative to the btrfs metadata update strategy. You might find it's easier to get to the filesystem you want by starting from bcachefs instead of btrfs. [this part may be appropriate for a wider audience than this issue] Lots of people have great ideas for how btrfs should be improved. That's great! Unfortunately, all existing developers are fully committed to either support work for bugs in existing features, or other new roadmap items--often for their paying customers, who naturally consider the issues they hired a developer for to be more important than anyone else's. Any new feature work will therefore require new developers to do it. Proposals are far more effective when you bring a developer with you (or you are one yourself)--after all, if you can't convince your own developer your idea is a good one, how could you convince the rest of the btrfs maintainers? Here's what you need:
If you need help to fill in the gaps then developers can answer questions about information you're missing, but they can't hold your hand while you do the whole thing. These are not enough:
You're welcome to attempt an improvement to btrfs. If you're successful, you'll be a hero. ;) |
if we compare BTRFS to any other FS - that is miost unfriendly to any non-HDD drives (all what has possible wear-out issue) as every small write will make amplification.
We must not forget that any operations mainly done in small chunks like 512b or 4kb.
if BTRFS used for any important data
SQLs, VMs - write chuncs must be syncronous and almost not buffered (RT and data sync requirements).
if Logging - buffering must be adequate, and mostly not that 1 time per hour, so chunking "1Mb" is quitre big. and also logging is not primary use for that FS.
Simple Data - average size of file "business"-communication emails is 50Kb,, invoices, word/excel documents are 30Kb
what i found - current overhead for every and any write to drive is at least 160Kb
so amplification is average 6x for such
in case of databases - they often make short record write even below 512b ... so there is amplification 300x what that means - every can decide.
VM = 4k , goes +200 = amplification 50X
surely will be mentioned COW and so on ... so comparison is done against zfs ... even there amplification is not comparable.
everyone can measure impact to drives by doing own tests
so if you have any dymanic data on that FS you will get rapidly increased wearing.
sure many peoples may have point that resources not matters and can be bought replaced often to newer, or expanded if programm is not enough optimised, but that not means that is must be as excuse for design flaws, if exist possibility to improve.
The text was updated successfully, but these errors were encountered: