Forwarded from farseerfc 😂
寫的時候,在內存裏計劃寫兩份,一份是純 mirror ,一份是 raid5/6 ,後面這份稍微延期一點湊夠完整的 stripe 再寫,後面這份落盤上的話就可以扔掉前面那份,如果中間沒有別的原因導致 flush 的話,這些都發生在內存裏,就可以只落盤一次
Forwarded from farseerfc 😂
按我理解這裏 bcachefs 想做的是併發發起兩個寫入, R1 獨佔絛帶, R2 和別的寫入合併在一個絛帶,然後 R2 等到完整一個絛帶就可以落盤這時候就可以取消 R1 的寫入,至於 R1 能不能落盤要看 IO 調度了
Current bugs with operational impact on btrfs raid5

The following is a list of bugs that are causing repeated operational
issues on RAID5 arrays on btrfs. Confirmed recently on 5.4.41.

I've given the bugs some short labels because they are very specific and
very similar, but distinct in critical ways, e.g. parity-update-failure
and read-repair-failure are nearly identical except one requires writes,
while the other occurs under lab test conditions when no writes occur.

These bugs occur in degraded mode:

Name: spurious-degraded-read-failure

Report: https://lore.kernel.org/linux-btrfs/[email protected]/
"Spurious read errors in btrfs raid5 degraded mode"

Summary: file reads in degraded mode sometimes spuriously fail
with csum errors and EIO. The remaining data and parity blocks
are correct on surviving disks, but the kernel sometimes cannot
reconstruct the missing data blocks while in degraded mode.
Read errors stop once the array exits degraded mode. Only data
in non-full block groups is affected (most likely also non-full
raid stripes).

Impact: applications do not respond well to random files having
EIO on read. 'btrfs device remove', 'btrfs balance', and 'btrfs
resize' abort frequently due to the read failures and are not
usable to return the array to non-degraded mode. As far as I
can tell, no data is lost in raid5 data block groups, even if
that data is written in degraded mode. If this bug occurs in
raid5 metadata block groups, it will make the filesystem unusable
(frequently forced read-only) until this bug is fixed.

Workaround: Use 'btrfs replace' to move array to non-degraded
mode before attempting balance/delete/resize operations.
Stop applications to avoid spurious read failures until the
replace is completed. Never use raid5 for metadata.

Name: btrfs-replace-lockup

Report: https://lore.kernel.org/linux-btrfs/[email protected]/
"btrfs raid5 hangs at the end of 'btrfs replace'"

Summary: 'btrfs replace' sometimes hangs just before it is
finished.

Impact: reboot required to complete device replace.

Workaround: None known.

Name: btrfs-replace-wrong-state-after-exit

Report: https://lore.kernel.org/linux-btrfs/[email protected]/
"btrfs raid5 hangs at the end of 'btrfs replace'"

Summary: 'btrfs replace' replace state is not fully cleared.
The replace kernel thread exits, 'btrfs replace status' reports
the replace is complete, but later resize and balance operations
fail with "resize/replace/balance in progress" until the
kernel is rebooted.

Impact: (another) reboot required to complete device replace.

Workaround: None known.

These bugs occur when all disks are online but one is silently corrupted:
Name: parity-update-failure

Report: https://www.spinics.net/lists/linux-btrfs/msg100178.html
"RAID5/6 permanent corruption of metadata and data extents"

Summary: if a non-degraded raid stripe contains a corrupted
data block, and a write to a different data block updates the
parity block in the same raid stripe, the updated parity block
will be computed using the corrupted data block instead of the
original uncorrupted data block, making later recovery of the
corrupted data block impossible in either non-degraded mode or
degraded mode.

Impact: writes on a btrfs raid5 with repairable corrupt data can
in some cases make the corrupted data permanently unrepairable.
If raid5 metadata is used, this bug may destroy the filesystem.

Workaround: Frequent scrubs. Never use raid5 for metadata.

Name: read-repair-failure

Report: https://www.spinics.net/lists/linux-btrfs/msg94590.html
"RAID5 fails to correct correctable errors, makes them
uncorrectable instead"

Summary: if a non-degraded raid stripe contains a corrupted data
block, under unknown conditions a read can update the parity
block in the raid stripe using the corrupted data block instead
of the original uncorrupted data block, making later recovery
of the corrupted data block impossible in either non-degraded
mode or degraded mode.

Impact: reads on a btrfs raid5 with repairable corrupt data
can in some cases make the corrupted data permanently
unrepairable. If raid5 metadata is used, this bug may
destroy the filesystem.

Workaround: Frequent scrubs. Never use raid5 for metadata.

Name: scrub-wrong-error-types

Report: https://lore.kernel.org/linux-btrfs/[email protected]
"RAID5 fails to correct correctable errors, makes them uncorrectable instead"

Summary: scrub on raid5 data sometimes reports read errors instead
of csum errors when the only errors present on the underlying
disk are csum errors.

Impact: false positives are included in the read error count
after a scrub. It can be difficult or impossible to correctly
identify which disk is failing.

Workaround: none known.

Name: scrub-wrong-error-devices

Report: https://lore.kernel.org/linux-btrfs/[email protected]
"RAID5 fails to correct correctable errors, makes them uncorrectable instead"

Summary: scrub on raid5 data cannot reliably determine the
failing disk when there is a mismatch between computed parity
and the parity block on disk, and some of the data blocks in the
raid stripe do not have csums (e.g. free blocks or nodatacow
file blocks). This cannot be fixed with the current on-disk
format because the necessary information (csums for free and
nodatasum blocks to identify parity corruption by elimination,
or csum on the parity block itself) is not available.

Impact: parity block corruptions on one disk are reported in
scrub error counts as csum errors distributed across all disks.
It can be difficult or impossible to correctly identify which
disk is failing.

Workaround: none known.

No list of raid5 bugs would be complete without:

Name: parity-raid-write-hole

Report: numerous, it's probably the most famous btrfs raid5 bug,
and the one inexperienced users blame most often for data losses.

Impact: negligible. It occurs orders of magnitude less often
and destroys orders of magnitude less data each time it occurs
compared to the above bugs.

Workaround: Don't worry about write hole yet. The other bugs
will ruin your day first.

https://lore.kernel.org/linux-btrfs/[email protected]/
fc fs筆記
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=969495#10
extundelete works by trying to look at the journal and tries to figure out how to find old metadata blocks left over in older transactions in the jbd2 journal file. It's not part of e2fsprogs, but its own separate package.

The extundelete program is a massive abstraction violation, and whether or not it works is essentially an accident. The ext4 developers don't consider themselves bound by any kind of guarantees that extundelete will continue to work in the future. We aren't going to deliberately break it, but if we add new features to make ext4 more flexible or robust (which would be the case with the metadata checksum feature), and extundelete happens to break, our reaction will be:

¯\_(ツ)_/¯

My suggestion is that you use regular backups and/or userspace solutions such as the trash-cli package, which implements the Freedesktop.org Trash Can specification:
fc fs筆記
JRTipton_ReFS_v2.pdf
ReFS v2: Cloning, Projecting, and Moving Data

File systems are fundamentally about wrapping abstractions around data: files are really just named data blocks. ReFS v2 presents just a couple new abstractions that open up greater control for applications and virtualization.

We'll cover block projection and cloning as well as in-line data tiering. Block projection makes it easy to efficiently build simple concepts like file splitting and copying as well as more complex ones like efficient VM snapshots. Inline data tiering brings efficient data tiering to virtualization and OLTP workloads.
https://lore.kernel.org/linux-btrfs/[email protected]/T/

How robust is BTRFS?

This is a testimony from a BTRFS-user.
For a little more than 6 months, I had my server running on BTRFS.
My setup was several RAID-10 partitions.
As my server was located on a remote island and I was about to leave, I just added two more harddisks, to make sure that the risk of failure would be minimal. Now I had four WD10JFCX on the EspressoBIN server running Ubuntu Bionic Beaver.

Before I left, I *had* noticed some beep-like sounds coming from one of the drives, but it seemed OK, so I didn't bother with it.

So I left, and 6 months later, I noticed that one of my 'partitions' were failing, so I thought I might go back and replace the failing drive. The journey takes 6 hours.

When I arrived, I noticed more beep-like sounds than when I left half a year earlier.
But I was impressed that my server was still running.

I decided to make a backup and re-format all drives, etc.

The drives were added in one-by-one, and I noticed that when I added the third drive, again I started hearing that sound I disliked so much.

After replacing the port-multiplier, I didn't notice any difference.

"The power supply!" I thought.. Though it's a 3A PSU and should easily handle four 2.5" WS10JFCX drives, it could be that the specs were possibly a little decorated, so I found myself a MeanWell IRM-60-5ST supply and used that instead.

Still the same noise.

I then investigated all the cables; lo and behold, silly me had used a china-pigtail for a barrel-connector, where the wires on the pigtail were so incredibly thin that they could not carry the current, resulting in the voltage being lowered the more drives I added.

I re-did my power cables and then everything worked well.

...

After correcting the problem, I got curious and listed the statistics for each partition.
I had more than 100000 read/write errors PER DAY for 6 months.
That's around 18 million read/write-errors, caused by drives turning on/off "randomly".

AND ALL MY FILES WERE INTACT.

This is on the border to being impossible.

I believe that no other file system would be able to survive such conditions.
-And the developers of this file system really should know what torture it's been through without failing.
Yes, all files were intact. I tested all those files that I had backed up 6 months earlier against against those that were on the drives; there were no differences - they were binary identical.