0

I have an 8TiB disk attached via UBS3 and formatted into 3 EXT3 partitions which I use as a backup drive (it's plugged into a SATA cradle).

The disk has been attached and mounted for several days without being explicitly written to (I backed up some data a couple of days ago).

I happened to take a look at dmesg and spotted the following (this is filtered to show only entries matching the disk name, sdg):

[393945.628890] EXT4-fs (sdg2): error count since last fsck: 4
[393945.628894] EXT4-fs (sdg2): initial error at time 1589268773: ext4_validate_block_bitmap:406
[393945.628897] EXT4-fs (sdg2): last error at time 1589336019: ext4_validate_block_bitmap:406
[394076.698059] EXT4-fs (sdg1): error count since last fsck: 103
[394076.698063] EXT4-fs (sdg1): initial error at time 1589216157: ext4_validate_block_bitmap:406
[394076.698066] EXT4-fs (sdg1): last error at time 1589372294: ext4_lookup:1590: inode 186081476

I've not run fsck on this disk since it was partitioned and formatted. Given that fsck has not been run what is finding the errors and how concerned should I be?

When I rebooted the system this morning I checked dmesg again and found (again filtered to show only entries matching sdg)

[  261.721822] sd 9:0:0:0: [sdg] Spinning up disk...
[  274.051062] sd 9:0:0:0: [sdg] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
[  274.051065] sd 9:0:0:0: [sdg] 4096-byte physical blocks
[  274.051137] sd 9:0:0:0: [sdg] Write Protect is off
[  274.051140] sd 9:0:0:0: [sdg] Mode Sense: 43 00 00 00
[  274.051297] sd 9:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  274.051498] sd 9:0:0:0: [sdg] Optimal transfer size 33553920 bytes not a multiple of physical block size (4096 bytes)
[  274.134309]  sdg: sdg1 sdg2 sdg3
[  274.135296] sd 9:0:0:0: [sdg] Attached SCSI disk
[  274.654835] EXT4-fs (sdg3): mounting ext3 file system using the ext4 subsystem
[  274.696860] EXT4-fs (sdg3): warning: mounting fs with errors, running e2fsck is recommended
[  274.766709] EXT4-fs (sdg1): mounting ext3 file system using the ext4 subsystem
[  274.795109] EXT4-fs (sdg1): warning: mounting fs with errors, running e2fsck is recommended
[  274.825210] EXT4-fs (sdg2): mounting ext3 file system using the ext4 subsystem
[  274.891191] EXT4-fs (sdg2): warning: mounting fs with errors, running e2fsck is recommended
[  275.713323] EXT4-fs (sdg2): mounted filesystem with ordered data mode. Opts: (null)
[  276.460528] EXT4-fs (sdg3): mounted filesystem with ordered data mode. Opts: (null)
[  276.499085] EXT4-fs (sdg1): mounted filesystem with ordered data mode. Opts: (null)
[  578.549827] EXT4-fs (sdg1): error count since last fsck: 103
[  578.549830] EXT4-fs (sdg1): initial error at time 1589216157: ext4_validate_block_bitmap:406
[  578.549832] EXT4-fs (sdg1): last error at time 1589372294: ext4_lookup:1590: inode 186081476
[  578.549836] EXT4-fs (sdg3): error count since last fsck: 47
[  578.549837] EXT4-fs (sdg3): initial error at time 1589268525: htree_dirblock_to_tree:1022: inode 31604737: block 126419458
[  578.549840] EXT4-fs (sdg3): last error at time 1589380312: ext4_lookup:1594: inode 33701921
[  578.549844] EXT4-fs (sdg2): error count since last fsck: 4
[  578.549845] EXT4-fs (sdg2): initial error at time 1589268773: ext4_validate_block_bitmap:406
[  578.549847] EXT4-fs (sdg2): last error at time 1589336019: ext4_validate_block_bitmap:406
[  639.938843] EXT4-fs (sdg1): mounting ext3 file system using the ext4 subsystem
[  640.950738] EXT4-fs (sdg1): mounted filesystem with ordered data mode. Opts: (null)
[  650.900006] EXT4-fs (sdg2): mounting ext3 file system using the ext4 subsystem
[  651.207658] EXT4-fs (sdg2): mounted filesystem with ordered data mode. Opts: (null)
[  658.836040] EXT4-fs (sdg3): mounting ext3 file system using the ext4 subsystem
[  659.084558] EXT4-fs (sdg3): mounted filesystem with ordered data mode. Opts: (null)

So the system knows there are errors and has still mounted the disk without displaying any warnings other than the entries in dmesg.

Roughly 30 minutes later I checked again because I was curious now and found:

[  955.353027] EXT4-fs (sdg2): error count since last fsck: 3248
[  955.353031] EXT4-fs (sdg2): initial error at time 1589268773: ext4_validate_block_bitmap:406
[  955.353033] EXT4-fs (sdg2): last error at time 1589437923: ext4_map_blocks:604: inode 103686210: block 1947002998
[  955.353039] EXT4-fs (sdg1): error count since last fsck: 103
[  955.353040] EXT4-fs (sdg1): initial error at time 1589216157: ext4_validate_block_bitmap:406
[  955.353042] EXT4-fs (sdg1): last error at time 1589372294: ext4_lookup:1590: inode 186081476
[  956.751484] EXT4-fs error (device sdg2): ext4_map_blocks:604: inode #103686210: block 1947002998: comm updatedb.mlocat: lblock 12 mapped to illegal pblock 1947002998 (length 1)
[  956.767496] EXT4-fs error (device sdg2): ext4_map_blocks:604: inode #103686210: block 1947002998: comm updatedb.mlocat: lblock 12 mapped to illegal pblock 1947002998 (length 1)
[  956.782683] EXT4-fs warning (device sdg2): htree_dirblock_to_tree:994: inode #103686210: lblock 12: comm updatedb.mlocat: error -117 reading directory block

Eeek! The error count has increased for sdg2!

Again I've not explicitly written to the disk all this time.

Before partitioning & formatting the drive with gparted I used fsck to run a bad block scan (took several days) and no errors were found. This is also a new disk. For this reason, I'm reasonably confident that the hardware is good.

What is possibly going on here? How worried should I be about the integrity of filesystems on this disk? What should my next steps be?

dsnowdon
  • 3
  • 1
  • 3
  • 1
    You should always be worried when you're seeing disk errors, there should be none. Check SMART, check if connection is reliable (loose cables etc.), copy important data to another disk. Once data is safe you can run fsck. – gronostaj May 14 '20 at 07:28
  • 1
    *"This is also a new disk. For this reason, I'm reasonably confident that the hardware is good."* -- Your confidence is misplaced. Mechanical shock to a *"new" drive has the same affect regardless of age. – sawdust May 14 '20 at 07:53
  • As @Sawdust said, new disks being bad are surprisingly common – davidgo May 14 '20 at 08:16
  • I'm pretty sure file system check is run while Linux is booting up. – vssher May 14 '20 at 08:25
  • @sawdust I did also say that I'd run a bad block scan (sudo fsck.ext3 -f -cc -C0 /dev/sdg1) which had shown no errors – dsnowdon May 14 '20 at 13:04
  • So how many days ago did you run that test on a *different* partition/filesystem? So those are *old* test results. Why not perform a new test if you are concerned? FWIW a *"SATA cradle"* should only be used IMO for temporary use, i.e. not continuous. I use isolation mounting; see https://superuser.com/questions/1510488/solution-for-mounting-hard-drive-better-on-desk-to-prevent-sata-cable-wear/1510530#1510530 – sawdust May 14 '20 at 21:47
  • I was only using the cradle long enough to make a backup and test it. The disk stayed in the there a few days. I'm going to try replacing the USB3 cradle with a drive bay mount with a SATA connection and see if I get better results. – dsnowdon May 16 '20 at 14:08
  • @sawdust it was the same disk, but yes it was before I re-partitioned it so that was result was a few days old. You're right that it would not hurt to re-run it – dsnowdon May 16 '20 at 14:10

1 Answers1

0

The errors appear in the kernel dmesg log because they were encountered as the filesystem was being used. It is as simple as that.

Mounting the disk in the first place requires reading certain information from the filesystem, and if that information was corrupt or inconsistent, it will be noted in the kernel dmesg log.

Unfortunately, disks can degrade over time, when left unused, even if you do not write anything to them. It could be for any number of reasons - mechanical, electrical, shock damage, ambient magnetism, heat stress, or even ambient background radiation. Disks simply don't last forever, even if not used.

I would suggest you recover everything you can from that disk as soon as possible. Even then, you may find some files are corrupted - but fingers crossed.

Ben XO
  • 314
  • 2
  • 8
  • 1
    I think the moral of the story is also not to run backups using a USB3 cradle on my desk. Since I switched to using a SATA connected caddyless tray in a drive bay I've not seen any errors (yet) after backing up 4.5TiB of data. – dsnowdon May 19 '20 at 09:24
  • Unrelated, there's no reason why USB and / or a cradle would, in and of themselves, cause drive errors. A new disk may still have errors, manufacturing isn't perfect. Who knows what happened to it in the supply chain? – antgel Aug 14 '21 at 10:40
  • @antgel it can happen if you have a bad cable, bad power supply or some other reason that the communication with the device is flaky. It's not common though. – Ben XO Aug 17 '21 at 10:43