smartctl causing errors in dmesg with SAS SSD drives

2023-06-18

For some time, I've observed a weird behavior with several SSD SAS drives. Whenever I checked their info using 'smartctl -a /dev/sdX', an error message appeared in dmesg:

[  510.392322] sd 11:0:9:0: [sdk] tag#3990 Sense Key : Recovered Error [current]
[  510.392336] sd 11:0:9:0: [sdk] tag#3990 Add. Sense: Grown defect list not found

Finally, it came time that I decided to investigate it.

After a little googling, I've discovered, that SCSI drives (and therefore SAS drives) contain two instances of defect list indicating which disk sectors have gone bad. The first list - "Primary defect list" - is created when the drive is manufactured and lists all defects present in the drive when it was made. The second list - "Grown defect list" - is the more useful one, where all the new defects (bad sectors) are appended.

Historically, this all made sense until the SSD drives were introduced. Since their internal structure is different from HDDs (there are no 'sectors') it doesn't make much sense to use this list. But the command for getting the list is still in the specification, so vendors had to choose what to do when drive is asked to print the list. Some vendors decided, that the drive should simply tell the truth and report that there is no grown defect list. But in addition to it being the truth, it also causes kernel to log an error. What's worse, is that it can also increase the number on "non-medium error count" every time the disk is queried! That looks to me like a little oversight.

There is a proposed 'hot-fix'[1] one can apply to smartmontools (which smartctl is a part of), but it relies on just exiting the function reading the defect list, independent of whether the drive has the list or not. But what if I want to check the HDD? In that case, I could miss some important info!

I decided to add a little check if the drive is an SSD. If so, then don't query it for defect list. In other case it should be checked normally. Smartmontools differentiate SSDs by assigning them an RPM value of one, so simply checking this variable for the device in question allows deciding if the defect list should be checked. I added the following to functions 'scsiReadDefect10' and 'scsiReadDefect12' in file 'scsicmds.cpp':

bool is_scsi = (strcmp(device->get_dev_type(), "scsi") == 0);
int rpm = scsiGetRPM(device,0,0,0);

if (scsi_debugmode)
  pout("is_scsi: %d, rpm: %i\n", is_scsi, rpm);

if (is_scsi && (rpm == 1))
{
  if (scsi_debugmode)
    pout("Hotfix: Assuming SCSI SSD doesn't have defect list\n");
  return 101;
}

I'm not sure if checking wether the drive is in fact 'scsi' is necessary, but better safe than sorry.

After compiling and testing smartctl with above, my dmesg no longer reports any problems :)

[1] https://forums.servethehome.com/index.php?threads/px04svb320-every-time-smart-accessed-non-medium-error-count-increments.28556/