Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTRFS - most unfriendly to ssd/nvme/sd-cards in cmp to any of other FS. Write Amplification #760

Open
DaLiV opened this issue Mar 20, 2024 · 13 comments
Labels
question Not a bug, clarifications, undocumented behaviour

Comments

@DaLiV
Copy link

DaLiV commented Mar 20, 2024

if we compare BTRFS to any other FS - that is miost unfriendly to any non-HDD drives (all what has possible wear-out issue) as every small write will make amplification.
We must not forget that any operations mainly done in small chunks like 512b or 4kb.
if BTRFS used for any important data
SQLs, VMs - write chuncs must be syncronous and almost not buffered (RT and data sync requirements).
if Logging - buffering must be adequate, and mostly not that 1 time per hour, so chunking "1Mb" is quitre big. and also logging is not primary use for that FS.
Simple Data - average size of file "business"-communication emails is 50Kb,, invoices, word/excel documents are 30Kb
what i found - current overhead for every and any write to drive is at least 160Kb
so amplification is average 6x for such

in case of databases - they often make short record write even below 512b ... so there is amplification 300x what that means - every can decide.
VM = 4k , goes +200 = amplification 50X

surely will be mentioned COW and so on ... so comparison is done against zfs ... even there amplification is not comparable.

everyone can measure impact to drives by doing own tests

so if you have any dymanic data on that FS you will get rapidly increased wearing.
sure many peoples may have point that resources not matters and can be bought replaced often to newer, or expanded if programm is not enough optimised, but that not means that is must be as excuse for design flaws, if exist possibility to improve.

@DaLiV DaLiV changed the title BTRFS - most unfriendly to ssd to any of other FS. Write Amplification BTRFS - most unfriendly to ssd/nvme/sd-cards to any of other FS. Write Amplification Mar 20, 2024
@DaLiV DaLiV changed the title BTRFS - most unfriendly to ssd/nvme/sd-cards to any of other FS. Write Amplification BTRFS - most unfriendly to ssd/nvme/sd-cards in cmp to any of other FS. Write Amplification Mar 20, 2024
@mjt0k
Copy link

mjt0k commented Mar 21, 2024

btrfs isn't really suitable for (large) databases or VMs. Some databases (eg postgres) when run on btrfs, explicitly switch data files to nodatacow mode, also virt-manager turns on nodatacow for VM images. Nodatacow helps, but in my opinion it is better to use a different filesystem for such stuff (eg, xfs has reflinks too and is generally well-suited for this kind of task). btrfs is much better for general-purpose usage. FWIW.

@Forza-tng
Copy link
Contributor

Forza-tng commented Mar 21, 2024

@DaLiV

As you clearly found out, one of the worst case scenarios for current Btrfs design is an fsync after each small write.

On ZFS, which is also a COW filesystem, this specific situation is handled via the ZFS Intent Log (ZIL) . Synchronous writes are first stored into the ZIL, then a background task moves the data into permanent storage. ZFS can further store the ZIL on a separate device (SLOG) to further increase performance.

I guess you could like ZIL to Write-Ahead-Logging (WAL) in databases such as sqlite and Postgres.

Whether this would be a possible solution for Btrfs, I really don't know. I believe that in Btrfs, synchronous writes are stored in the tree log. Perhaps there is some opportunity to improve this. There is already the "preferred metadata patches" that allows one to choose different devices for data and metadata workloads.

You also mention logging. I can ot quite understand what you mean, but in Btrfs you can use the flushoncommit mount option to ensure transactions do not span more than one commit, which usually is 30 seconds. The commit frequency can also be set using commit=.

Finaly, @DaLiV, I think your message is rude and harsh. While Btrfs has many performance issues, being rude about it is not helpful to anyone.

As a last point. I cannot see that the script provided includes writes on ZFS caused by the ZIL. Commit background thread when it moves data to final storage.

@Zygo
Copy link

Zygo commented Mar 21, 2024

There is no new information here. This issue is as old as btrfs.

https://ar5iv.labs.arxiv.org/html/1707.08514 has a good summary of the theory. They posit that it is impossible to fix the write amplification without creating a new issue somewhere else in the filesystem, such as storage amplification or lost data integrity. At best, you could design a filesystem where write amplification is reduced, but read amplification is increased as a consequence. The authors also posit that this is a better tradeoff for a filesystem on devices where reads are cheaper than writes, and they are probably right, but the only way a btrfs user can benefit from this information is to use it to select a replacement filesystem. Until that filesystem comes along, the write amplification is simply something we have to learn to live with.

btrfs does have support for nodatacow files, but enabling nodatacow requires turning off other btrfs features such as data integrity and snapshots, which reduces the write amplification at the expense of data integrity. btrfs still has fairly large write multiplication in this configuration because nodatacow only applies to data, so metadata updates like inode timestamps are still relatively large. The result is never faster than creating a separate block device and running ext4 or xfs on it--after every possible optimization is done, btrfs metadata is still an order of magnitude larger, extent for extent, than the metadata on ext4 or xfs, and it's simply more iops to push that much extra metadata out to the drive and back.

@DaLiV
Copy link
Author

DaLiV commented Mar 21, 2024

btrfs is much better for general-purpose usage. FWIW.
exactly even general-purpose lead to ssd-wearouts ...

Btrfs design is an fsync after each small write
without sync - you can not determine what exactly going on ... and sync or not - writing of 512 bytes to file once in minute will lead to the same result ... many GB pro month with quite low amount of changes in files ...

additionally can say - already has some faults with "fault tolerance" by power outages ... "last revision in fs header is not matching to the last in tree ..." which also unsolvable. was needed to do full restore ... that means that FS is unstable by "long not writing times" and kills drives at "fsyncs" ... so metadata of btrfs is very fragile ...

so can conlude that all sayed here confirms that btrfs is badly buit for non-rotational drives with limited rewrite-counts.

your message is rude and harsh
when someone make critic - that is not rudity. expecially if exist some possibilities to improve.
Rude was when was lost 2 months of data in some simple "power outages", and 3 weeks in another case (even weekly snapshots helped not) ... simple outages which any other FS (ext2-3/xfs) silently survives with minor problems. yes - i have in use BTRFS on production servers (with hdd and hybrid) but trust is going to state "lost".

sadly we have not so many COW filesystems. and i hope that it can be improved in some future versions.

As a last point. I cannot see that the script provided includes writes on ZFS caused by the ZIL. Commit background thread when it moves data to final storage

full sync, unmount, and export.

@Forza-tng
Copy link
Contributor

Forza-tng commented Mar 21, 2024

@DaLiV what is your goal with your posts? I don't see anything constructive with your remarks.

Btrfs has many feature other Linux filesystems do not, which to me are more valuable than the added cost of metadata updates.

@DaLiV
Copy link
Author

DaLiV commented Mar 21, 2024

constructive - btrfs need improvements which eliminates write amplification, which in fact kills good part of "pros" for "normal usage".
and tools for "internal recovery" /but that is another topic/ ...
but seems write amplification is already known issue with status "won't / can't fix" - in this case - can close "issue".
"pros" points of btrfs i knew ... but sadly further migration of other servers to SSD and btrfs is blocked for me as are they are not practical/irrational ...

@polarathene
Copy link

btrfs does have support for nodatacow files, but enabling nodatacow requires turning off other btrfs features such as data integrity and snapshots, which reduces the write amplification at the expense of data integrity.

You can use nodatacow and snapshots feature still. The data is temporarily CoW at snapshot time, then back to nodatacow.

I'm not sure how that applies for other features but assume that for the point in time of the snapshot, you'd have access to other related features at least, they're just not as useful in-between snapshots.

Regarding reflinks, I'm not sure if much has changed since 2021 but @Zygo and @Forza-tng were both involved in this 2021 mailing list discussion thread (long and full of technical details), while this one in 2022 mentions reflinks are supported for nodatacow but cp support was at fault compatibility wise (workaround was to create the destination file as a stub in advance and set +C prior to cp command). That last link also notes nodatacow has the same requirement for deduplication feature support. EDIT: BTRFS docs on reflinks support relays that same information, so likely still relevant.


btrfs is badly built for non-rotational drives with limited rewrite-counts.

That's dependent on what your priorities for a filesystem are.

As you should be aware BTRFS has features that other filesystems lack, there are some tradeoffs that come with that so you'll want to weigh up which filesystem is appropriate for the context of the hardware available and workload you need to support.

You may be better served by F2FS or even EROFS for such a storage device where writes are a concern. BTRFS can still work well if it has features you need, but may be complimented by other solutions if it adds friction regarding some workload requirements you have regarding write activity, such as:

  • Bind mount a different FS where appropriate
  • In some cases a write cache with periodic syncing to a backing filesystem may be appropriate:
    • Like anything-sync-daemon does with tmpfs / overlayfs and rsync.
    • Here is a separate example for /var/log to minimize writes on an RPi.

btrfs isn't really suitable for (large) databases or VMs. Some databases (eg postgres) when run on btrfs, explicitly switch data files to nodatacow mode, also virt-manager turns on nodatacow for VM images.

It depends on your workload requirements. If you want to run with datacow, you can and depending on the context doing so may not be a concern for you.

As mentioned above, you can still leverage snapshots, reflinks and deduplication with nodatacow. A DB should have a variety of settings for you to decide on when to favor a feature available from both the filesystem or DB, with the fsync concern and postgres you can tailor that for example, it's not all or nothing. Similar to how the kernel has a file cache / buffer in memory that can be tuned for how often it should flush writes to disk, VM disks have similar features for interacting with writes from the guest to the host backing storage.


Simple Data - average size of file "business"-communication emails is 50Kb,, invoices, word/excel documents are 30Kb
what i found - current overhead for every and any write to drive is at least 160Kb

Compression + Deduplication should minimize storage concerns?

We must not forget that any operations mainly done in small chunks like 512b or 4kb.

I am a bit rusty on this, but an SSD has a blocks of a larger size, that represents a group of pages/sectors typically of the size you have mentioned (although you can get larger, and physically it can be 4K pages but 512e via firmware exposes it to the OS as a smaller size).

I've also heard that some hardware internally dedupes pages, and it's not uncommon for a faster write cache of X capacity which the controller could probably optimize some writes at the hardware level, thus actual wear is not necessary as bad as you may assume with modern hardware.

Then there is the filesystem layer that manages it's own layer typically in a similar fashion. A write to disk for an SSD can be spread across physical pages and blocks, while on the filesystem layer it's potentially treated as contiguous or fragmented across multiple extents. Each fragment gets queried to the disk to retrieve for a read of a file for example, that's where extra overhead can be introduced with regards to IOPS.

You also have the kernel providing some generic filesystem agnostic features that can coalesce some I/O within a buffer to make that more efficient, with similar offered by the filesystem, disk controller hardware and in some cases the application software too (like DBs/VMs).

Those small chunks you refer to can then be possibly optimized into operations that perform better. This was already touched on in earlier responses.


btrfs need improvements which eliminates write amplification, which in fact kills good part of "pros" for "normal usage".
sadly further migration of other servers to SSD and btrfs is blocked for me as are they are not practical/irrational

BTRFS doesn't have to be a hammer that you use for everything with same settings throughout. You have many options available either within in BTRFS or opting for an alternative filesystem when it serves your workload requirements better.

Defaults cannot accommodate everyone, when it matters to you (performance / durability) you should be in a position to better understand what you're working with and the tunables available to you :)

@kdave kdave added the question Not a bug, clarifications, undocumented behaviour label Mar 22, 2024
@DaLiV
Copy link
Author

DaLiV commented Mar 29, 2024

You can use nodatacow and snapshots feature still. The data is temporarily CoW at snapshot time, then back to nodatacow.

tested nodatacow - not a "gamechanger"
nodatacow - 184K for 16K - overhead 168K
loop0 0.02 0.03 1.18 4.12 3574 150097 524289 = btrfs 172 cp.4k.01
loop0 0.02 0.03 1.19 4.12 3574 150809 524289 = btrfs 184 cp.16k.1
datacow - 216K for 16K - overhead 200K
loop0 0.01 0.03 0.05 2.06 3620 5748 262144 = btrfs 204 cp.4k.01
loop0 0.01 0.03 0.05 2.06 3620 6588 262144 = btrfs 216 cp.16k.1

Compression + Deduplication should minimize storage concerns?

mostly no - take as example "email" data - every of them is diffirent, or word/excel dociment with changed 1 letter inside gives "completely another" file - what gives impossibility them "deduplicate" so every will have amplification. for big-vm data, which are "cloned" from reference - that was mentioned also "single block overwrite" leads to the same results ... and suggested to "better other FS" ... so for which type of general-purpose that will be not an issue ?

as i already mentioned - all that is (SSD|NVME|FLASH)-related and not a problem for Rotational-HDDs ( im last supposed type is CMR of hdd , how good on SMR-type is other question as then related wholely on HW specific - and are not for this topic )

@polarathene
Copy link

polarathene commented Mar 29, 2024

mostly no - take as example "email" data - every of them is diffirent, or word/excel dociment with changed 1 letter inside gives "completely another" file - what gives impossibility them "deduplicate" so every will have amplification.

Compression should be notably smaller than uncompressed for text content. If the overhead you are concerned about is minimized so that it is more comparable to the size you see allocated on another filesystem (which likely lacks compression), what is the issue? It may be even less disk.

Deduplication doesn't have to be paired with compression, depends on your workload. AFAIK both features don't operate on the entire file, but small blocks / extents. So you should still find that this can work well compared to a filesystem that lacks the feature.

which are "cloned" from reference - that was mentioned also "single block overwrite" leads to the same results ... and suggested to "better other FS" ... so for which type of general-purpose that will be not an issue ?

Reflink copies will not use extra space, you share the extents. Then only new writes use disk space. This is available in BTRFS or another filesystem like XFS, but others do not support it, so it really depends what you're comparing to.


If BTRFS does not suit your workload needs, that's ok you can choose another filesystem. BTRFS like other filesystems is not meant to be best in class for every workload. You choose it for the features it has available, performance and overhead concerns are not always the higher priority in what filesystem you choose, if they are for you then another filesystem may meet your needs better :)

@Zygo
Copy link

Zygo commented Mar 30, 2024

tested nodatacow - not a "gamechanger" [168K vs 200K]

When testing nodatacow vs datacow, remember that the inode update for the mtime timestamp will be far larger than the data in a single 16K write. If the test is 1000 random 16K writes on a nodatacow file (with no fsync between, e.g. fdatasync, or simply wait for the transaction commit) then the overhead will be somewhat lower per write. If the test is "one 16K write, then fsync" then nodatacow only saves the extent tree and csum tree updates (about 16K each--pretty close to the 32K you observed) and you pay full price for the inode timestamp.

Note that metadata update costs go up with the size of the tree, so these overheads are about 3x smaller on a 100 GiB filesystem vs. a 100 TiB one.

An effective way to reduce writes is to use datacow files, mount the filesystem with flushoncommit and a long commit interval, and disable fsync/fdatasync in applications. Assuming we don't run out of memory, this batches up all changes and writes them out in a single burst which atomically updates the data up to the instant the commit starts (it's critical to disable all use of fsync in this setup, or zero the log before mounting, or the fsynced writes will appear out of order). This has much less overhead per write than fsync, with the caveat that the entire filesystem rewinds to an earlier state after a crash. It behaves like async_commit in databases--you get all the transactions up to a point, and none after that--so it's only usable if the application can accept the tradeoff. Also we get data csums this way, so when our cheap SSD starts fading bits, we can detect it in our daily scrub.

last revision in fs header is not matching to the last in tree

btrfs does check for common device firmware bugs, so you will not get silent data corruption on power failures. The flip side of that is that btrfs has zero tolerance for device firmware bugs that affect its metadata integrity--btrfs can tell precisely when and where the device has lied about its data integrity, and knows when it cannot trust the device any more--so any failure causes btrfs to come to a very loud and complete stop. If the device fails in this way, it is usually necessary to fix the device (i.e. disable write cache, replace the device with a different vendor/model/firmware, or add a raid1 mirror drive with better firmware) before it is usable with btrfs.

If you run a different filesystem on those devices, the devices will corrupt the data on power failure, and if the other filesystem doesn't have data csums, you won't know the data is corrupted unless an application tells you.

take as example "email" data - every of them is diffirent, or word/excel dociment with changed 1 letter inside gives "completely another" file

Nitpick: email data can be deduplicated with 4K granularity on btrfs, and there is a remarkable amount of duplication in real mail stores (thank Microsoft's block-oriented document formats for that--as long as the attachments are uncompressed). On the other hand, that fact doesn't help with write amplification in any way.

Deduplication definitely does not reduce writes on btrfs. The duplicated data must be written to disk first--if it's still in page cache, btrfs dedupe will first flush both copies of the data to disk, then compare and delete the deduplicated data in a separate metadata update. Only the total size is reduced. This is due to a somewhat literal interpretation of the requirements for the dedupe_file_range ioctl, and some details involving the boundary between the page cache and the filesystem (basically skipping the flush would require VFS to support copy-on-write pages, and VFS still does not support that despite several unsuccessful attempts so far to change it).

Compression trades data size for metadata size and dramatically reduces the maximum size of each extent (i.e. it proportionally adds more metadata per byte of data stored). It would only help significantly if you're writing a lot of files, each file is compressible, and each file fits into a single compressed extent (i.e. 128K or less). That fits the profile of a source checkout and maybe a build--but a build's write workload might be over 90% metadata updates, and there's no way we're getting 90% compression to cancel that out.

so for which type of general-purpose that will be not an issue ?

The problem with write amplification is usually seen as a lower performance ceiling rather than early device failure. Users who don't hit the performance ceiling are not likely to hit the end of the device lifetime too early as that will require writing at high rates for a long time.

Modern consumer SSDs can handle hundreds of terabytes, if not petabytes of writes, before their warranty specs are exceeded--and they usually continue operating far in excess of that. The write amplification is not a concern for longevity unless we're hitting double-digit DWPD, or we're using a specialized device with very low endurance (i.e. a datacenter "boot" drive, which is optimized for cost and has extremely low write endurance) or a firmware bug (e.g. the short lives of Samsung 980 PRO devices).

As a rule of thumb: if the workload requires fsync to work, or if write latency is an issue, and you don't need data integrity, compression, dedupe, or snapshots, then ext4 or xfs might be a better fit for that application than btrfs (or zfs or bcachefs). You can make a btrfs subvolume and make it nodatacow for /var/lib/postgresql about as easily as you can mkfs an xfs filesystem on another LV and mount it on /var/lib/postgresql--both are one line in /etc/fstab.

@DaLiV
Copy link
Author

DaLiV commented Mar 30, 2024

Modern consumer SSDs can handle hundreds of terabytes

However SSD also has SSD_Life_Left / Wear_Leveling_Count attributes , which still means "replace me" and that is "overwrite-count" dependent as going down even "Available_Reservd_Space" stay same and can not be treated as not "Pre-fail".

Deduplication

So what you can deduplicate ?
copy of emails ? - no - same email will have diffirent headers, which not only simple not-block-boundary-aligned but also even has diffirent length, so no same block will be seen content even on block-3-4-5...

  • that is only teoretical possibility, which not implemented in "real time" for copies that created by external processes not by "cp --reflink" like ... so that seems is myth-like. which currently again writes per file "big-metadata-tree-chunks"

Compression

try to compress "already compressed" like pdf jpeg docx xlsx, which are "mainline" of not-database workloads, so that can be treated also as one "mythical animal"

snapshots

that i use ...
and therefore only 2 FS currently going - BTRFS and ZFS, as both are more relaxed then LVM+(any-filesystem) in term space-organisation. and about zfs-related-issues - here is not a place to discuss.
that why has choosed BTRFS when data was on HDDs ... and nobody has care of "Media Wearout" issues

metadata update costs go up with the size of the tree

may be possible somehow to change behaviour from "full-BTree-points-rebuild" to "Blocks of- Double Linked Lists", what gives overwrite of lower block count per write

"full price for the inode timestamp"

that going over many btree chunks? - are that may be turned off ? likewise as noatime
may be possible new mode likewise cow-relaxed as extended-nodatacow ? - logically seems be good such, as cow must be inactive in that case and only if that is first time block change against any referencing snapshots afterwards that can be unlocked for "inplace updates".
if yes - then **seems here is exist point of "possibility for improvement for drastical minifying of write amplification" **

@Forza-tng
Copy link
Contributor

Forza-tng commented Mar 30, 2024

So what you can deduplicate ?
copy of emails ? - no - same email will have diffirent headers, which not only simple not-block-boundary-aligned but also even has diffirent length, so no same block will be seen content even on block-3-4-5...

that is only teoretical possibility, which not implemented in "real time" for copies that created by external processes not by "cp --reflink" like ... so that seems is myth-like. which currently again writes per file "big-metadata-tree-chunks"
Compression

try to compress "already compressed" like pdf jpeg docx xlsx, which are "mainline" of not-database workloads, so that can be treated also as one "mythical animal"

As for empirical data goes....:

❯ compsize /var/spool/mail/
Processed 282036 files, 322899 regular extents (349783 refs), 21993 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       62%      6.7G          10G          12G
none       100%       31M          31M          35M
zstd        62%      6.7G          10G          12G

Approximately 2GiB saved by deduplication. An additional 3.3GiB saved by compression. A total savings of 5.3GiB, or 38%.

Though deduplication is done after mail is received, so the compression savings on initial write is larger than 3.3 GB.

@Zygo
Copy link

Zygo commented Mar 30, 2024

"full price for the inode timestamp"

that going over many btree chunks? - are that may be turned off ? likewise as noatime
may be possible new mode likewise cow-relaxed as extended-nodatacow ? - logically seems be good such, as cow must be > inactive in that case and only if that is first time block change against any referencing snapshots afterwards that can be unlocked for "inplace updates".
if yes - then **seems here is exist point of "possibility for improvement for drastical minifying of write amplification" **

There's a number of ways to get there:

  • O_NOMTIME / nomtime mount option / no-mtime fsattr / etc. Extensions that eliminate mtime updates from the workload for specific files or mount points. To have the desired effect, they would also have to remove updates to the NFS generation field, which would make caching unreliable for NFS clients, but that's a reasonable documented limitation. I hear about these proposals from time to time, but I don't think they've been merged into mainline yet? (I can't find them in man pages for mount or chattr).
  • Use lazytime (and get it implemented for btrfs). Would do an inode update on the device once a day, whenever the inode is evicted from cache, or whenever the file size changes.
  • Use fdatasync which does not force the metadata update (it happens only once per commit interval). This is not very different from the current behavior--fsync doesn't do a subvol tree update either, it writes the updated inode to the log tree, and the following commit updates the metadata tree (so the update is written twice with fsync, but the fsync write is packed with other metadata in a single page).

Note that the current behavior does one 160-200K write for metadata tree update every 30 seconds. Your test might not reflect that if it does only one write during the test. The timing between writes affects the total amplification--there is write-combining even with fsync.

I wouldn't expect a drastic improvement from this except under very specific workloads (many files, 16K write to each, no more than once every 30 seconds). In other cases (e.g. thousands of writes to a single VM or database file within 30 seconds), the improvement from eliding mtime will be a very small part of the overall write load. 400K per minute is 0.98 TB TBW over 5 years--not even 0.5% of modern SSD endurance capacity, so you'd have to make that saving happen hundreds of times every minute of every day to increase wear lifetime by 50%.

However SSD also has SSD_Life_Left / Wear_Leveling_Count attributes

We find that attribute has no predictive value for device performance or failure within the first 5 years, so we ignore it. We run the drives until normalized wear is 100%, then we keep running the drives for years after that. Some of them fail, but there's no correlation with life_left when they do (age is a much better predictor).

SSD failure and especially SD card failure often starts with silent data corruption, which makes btrfs + datacow + datasum essential for failure detection (or you have to exclude the bottom two thirds of the SSD market to get to devices that reliably report UNC errors instead of garbage data).

copy of emails ? - no - same email will have diffirent headers, which not only simple not-block-boundary-aligned but also even has diffirent length, so no same block will be seen content even on block-3-4-5...

You seem to misunderstand--I'm speaking from experience of btrfs dedupe on email stores. It does require a mail store that separates headers and message bodies and stores the latter block-aligned (i.e. not mbox format). The message bodies are frequently byte-for-byte duplicates, and there's a dedupe hit every time someone CC's more than one of the users. It's not great performance--a dedicated mail store that can isolate attachments and delta-compress email threads is far more efficient than filesystem-level block dedupe--but it does work, and it can get a double-digit percentage reduction without changing the standard mail software.

None of this is relevant against write amplification, since btrfs dedupe always increases total write counts compared to not using dedupe. This is why I annotated the dedupe correction with the "Nitpick" label - I corrected a statement that was not factually accurate about dedupe, but also not relevant to the topic, whether accurate or not.

For dedupe to be relevant to reducing write amplification, a write-eliding dedupe implementation must be added to btrfs. The one started in 2014 was abandoned. You're welcome to make another attempt.

may be possible somehow to change behaviour from "full-BTree-points-rebuild" to "Blocks of- Double Linked Lists", what gives overwrite of lower block count per write

That sounds like a higher read amplification to pay for lower write amplification--which is exactly the tradeoff I pointed out at the start of the thread.

Note that a double-linked list implies a minimum of three block updates, the same as a 3-level btrfs tree; however, a tree shares its interior nodes while a double-linked list does not. It might be better or worse than tree updates for multi-block updates depending on the details of the commit. It's worse for reads because a tree's search amplification can't be more than about 3, while a double-linked list can be arbitrarily large (useful for journalling but not much else).

There is some opportunity to reduce the overhead a little in extent_tree_v2, but so far it looks like the minimal tree update is still 3 pages in 3 trees (144 KiB). e_t_v2 might drop additional updates over that.

We can also replace tree updates with a journal and update-in-place, which some researchers have tried, but without significantly reducing write amplification. The main problem is that we need to store uncommitted tree updates somewhere searchable; otherwise, we can't read the filesystem until the journal is finished updating. That means we keep a big update tree in memory, which limits its size and adds latency to writes, or we have a mini-filesystem acting as a cache inside btrfs--and that cache eats the write amplification savings on the main filesystem tree.

bcachefs has an interesting alternative to the btrfs metadata update strategy. You might find it's easier to get to the filesystem you want by starting from bcachefs instead of btrfs.


[this part may be appropriate for a wider audience than this issue]

Lots of people have great ideas for how btrfs should be improved. That's great! Unfortunately, all existing developers are fully committed to either support work for bugs in existing features, or other new roadmap items--often for their paying customers, who naturally consider the issues they hired a developer for to be more important than anyone else's.

Any new feature work will therefore require new developers to do it. Proposals are far more effective when you bring a developer with you (or you are one yourself)--after all, if you can't convince your own developer your idea is a good one, how could you convince the rest of the btrfs maintainers?

Here's what you need:

  1. A thorough understanding of how btrfs currently works at the device level, so that you understand the constraints btrfs must satisfy. This will allow you to skip over all variations of your proposal that are insane on btrfs. (Or it will convince you to start over with a new filesystem built from scratch--which is also acceptable!)
  2. A detailed technical description of how your idea would get from the btrfs we have to the btrfs you want.
  3. Testable code that implements (2). Alternatively, an accurate model simulation showing how (2) would behave in various use cases (an example of this might be an academic paper based on btrfs, or a profiler report showing clearly avoidable bottlenecks), but code is definitely preferred.
  4. Test results from (3) demonstrating that the code from (3) is worth maintainer effort to merge, and user effort to switch to.
  5. A summary of the trade-offs, if any (e.g. the read-amplification vs write-amplification trade-off)

If you need help to fill in the gaps then developers can answer questions about information you're missing, but they can't hold your hand while you do the whole thing.

These are not enough:

  1. Mere restatements of the known problems. btrfs is slow, it has high write amplification, it's not very resilient against write cache failures, the recovery tools suck....yes, we know. We already accommodate all of those issues for every single btrfs filesystem we deploy to and support in production.
  2. Vague suggestions to copy performance features from completely different filesystems without understanding why they make sense on the other filesystems, or might not make sense on btrfs. A writeback cache might speed this up? Go ahead and try it--we'd love to know if that's true, where you put the cache, how you solved various problems, and under what circumstances performance went up or down. Another filesystem performs better on a benchmark? That's very likely, as other filesystems are completely different from btrfs inside, and accepted different trade-offs and compromises. Catching up to some other filesystem's performance is new feature work, it's only a bug when the "other filesystem" is an older version of btrfs.
  3. Testable code proves you brought a developer with you to do the work you propose. Some people are better at theory than at practice, so not everyone brings code, but the theorists rigorously document their theories to the point where they are testable.
  4. Test results prove your idea works, so the rest of us don't have to speculate about why it doesn't. We can discuss all day our guesses about whether a double-linked list update is faster than a tree update in average and limit cases, but data from a working implementation puts that discussion to an end in a constructive way.

You're welcome to attempt an improvement to btrfs. If you're successful, you'll be a hero. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Not a bug, clarifications, undocumented behaviour
Projects
None yet
Development

No branches or pull requests

6 participants