fc fs筆記
My goal for the fsync tree log was to make it just do the right thing most of the time. We mostly got there, thanks to a ton of fixes and test cases from Filipe.

fsync(some file) -- all the names for this file will exist, without having to fsync the directory.

fsync(some dir) -- ugh, don't fsync the directory. But if you do, all the files/subdirs will exist, and unlinks will be done and renames will be included. This is slow and may require a full FS commit, which is why we don't want dirs fsunk.

What ext4 does is this:

fsync(some file) -- for a newly created file, the filename that it was created under will exist. If the file has a hard-link added, the hard link is not guarnateed to be written to disk

fsync(some dir) -- all changes to file names in thentee directory will exist after the crash. It does *not* guarantee that any data changes to any of files in the directories will persist after a crash.

It seems to me that it would be desirable if all of the major file systems have roughly the same minimum guarantee for fsync(2), so that application writers don't have to make file-system specific assumptions. In general the goal ought to be "the right thing" should happen.

The reason why ext4 doesn't sync all possible hard link names is that (a) that's not a common requiremnt for most applications, and (b) it's too hard to find all of the directories which might contain a hard link to a particular file. But otherwise, the semantics seem to largely match up with what Chris as suggested for btrfs.
https://lore.kernel.org/lkml/[email protected]/ NTFS Read-Write driver from Paragon
From: Zygo Blaxell @ 2020-08-28 4:36 UTC (permalink / raw) To: Eric Wong; +Cc: kreijack, linux-btrfs

> > Note that add/remove is orders of magnitude slower than replace. Replace might take hours or even a day or two on a huge spinning drive. Add/remove might take months, though if you have 8-year-old disks then it's probably a few days, weeks at most.

> Btw, any explanation or profiling done on why remove is so much slower than replace? Especially since btrfs raid1 ought to be fairly mature at this point (and I run recent stable kernels).

They do different things.

Replace just computes the contents of the filesystem the same way scrub does: except for the occasional metadata seek, it runs at wire speeds because it reads blocks in order from one disk and writes in order on the other disk, 99.999% of the time.

Remove makes a copy of every extent, updates every reference to the extent, then deletes the original extents. Very seek-heavy --including seeks between reads and writes on the same drive--and the work is roughly proportional to the number of reflinks, so dedupe and snapshots push the cost up. About the only advantage of remove (and balance) is that it consists of 95% existing btrfs read and write code, and it can handle any relocation that does not require changing the size or content of an extent (including all possible conversions).

Arguably this isn't necessary. Remove could copy a complete block group, the same way replace does but to a different offset on each drive, and simply update the chunk tree with the new location of the block group at the end. Trouble is, nobody's implemented this approach in btrfs yet. It would be a whole new code path with its very own new bugs to fix.

https://lore.kernel.org/linux-btrfs/[email protected]/T/#m3d45dd2d29692650a7b76e13e1819edf87455e05
From: Zygo Blaxell @ 2020-08-28 20:56 UTC (permalink / raw) To: Andrei Borzenkov; +Cc: Eric Wong, kreijack, linux-btrfs

> > Replace just computes the contents of the filesystem the same way scrub does: except for the occasional metadata seek, it runs at wire speeds because it reads blocks in order from one disk and writes in order on the other disk, 99.999% of the time.

> Does it write them to the same absolute disk locations? IOW - is it possible to use smaller disk for replace or it must be at least as large as original disk?

Replace writes data to the locations recorded in the chunk tree, i.e. the original disk locations on the missing disk.

In theory, you can resize the offline disk to be smaller than the replacement disk, then run btrfs replace. In practice, only some of the methods work (e.g. you must specify device ID and not device name when replacing) and only on recent kernel versions.

btrfs dev remove is equivalent to 'btrfs fi resize :0' followed by "remove empty device " so the performance will be very similar for the portion of the data that is resized; however, a combination of resize and replace is still much faster than device remove, which does it the slow way for all of the data.

https://lore.kernel.org/linux-btrfs/[email protected]/T/#m1f2cdf6f67b4361329d98699d1a163c69f9e7ac1
62.6 KB
Extent−like Performance from a UNIX File System
122.8 KB
An Implementation of a LogStructured File System for UNIX
還沒實現的計劃… btrfs 實現內置加密的難點在於 per file 或者 per subvol 的加密不能和 reflink 正交,btrfs 只有一棵 extent tree ,按現在的設計要加密就只能整個 pool 加密,不能只對某些 subvol 單獨加密。整 pool 加密的話相對 dm-crypt 的現有方案沒有什麼優勢