Friday, January 29, 2016

Storage Spaces and Latent Sector Errors / Unrecoverable Read Errors

I emailed to ask about storage spaces direct and how it handles Latent Sector Errors (LSE), otherwise known as Unrecoverable Read Errors. Here is the email I sent:

My company is in the process of evaluating different options for upgrading our production server environment. I’m tasked with finding a solution that meets our needs and is within our budget.

I’m trying to compare and contrast storage spaces direct with storage spaces utilizing JBOD enclosures. Data resiliency, integrity and availability are paramount. So I’m primary looking at both of these technologies from that perspective. Thus, if we go the JBOD route, we’re looking at implementing 3 enclosures and utilizing the enclosure awareness of storage spaces. This solution has existed longer then storage spaces direct and I would think has been tested more thoroughly. I like the scalability and elegance of storage space direct though. From a conceptual overview and a hardware setup perspective it just seems easier to grasp and it seems like a better solution.

My question is, how do both of these setups handle unrecoverable read errors/latent sector errors? Does one solution handle them better than the other?

There are horror stories about hardware RAID controllers evicting drives because of URE/LSE and then during RAID rebuilds encountering additional UREs/LSEs and bricking the storage. This is more worrisome when SATA disks are used (due to UREs/LSEs occurring more often and sooner with SATA disks compared to SAS disks.) How does storage spaces/S2D differ in this regard? I know one of the selling points of S2D is the use of SATA disks. I’m curious as to how this problem has been addressed since SATA disks are being promoted. What happens if there is a URE/LSE in end user data? What happens if there is a URE/LSE in the metadata used by storage spaces/S2D or the underlying file system?

Here is the response I received:

Both Spaces direct and Shared Spaces (with JBOD)  both rely on the same software raid implementation, difference is in the connectivity.  Software raid implementation does not throw away the entire drive on failure, we trigger activity to move the data out of the drive while keeping the copy till data is moved (if we have copies available).  On Write failure we try to move the impacted range right away while background activity is moving the untouched data out of the disk,  some of the disks fail to write but they can continue to support reads in which case the data on those drives can still be used to serve user requests.  Until the data on the failed drive is rebuilt on spare capacity the drive is not removed, user can still force but not automated.  On URE - we trigger rebuilt to recover lost copy, this is triggered both when reads errors detected while satisfying user error or by back ground scrub process.  Back ground scrub process detects URE by validating sector level checksum across copies and  validating. 

So it would appear that if you utilize storage spaces you don't have to worry about a LSE/URE taking out a drive and then a subsequent LSE/URE taking out another drive, thus taking down your array. 

No comments:

Post a Comment