What is System Managed Storage?
In my basic description of the Catalog Data Service (you should read that first if you have not already), I referred to system processes that do tasks like moving data between different storage tiers based on business rules, as well as managing the off-site vaulting process. Those are components of System Managed Storage in the legacy enterprise computing context. But in the DeepSpace context we need to do much more because we must make the catalog service invisible to the typical interactive user, and we do this by using the standard bootable file system to act as a user view into the catalog’s contents.
In the CDS description I only touched on the features of the catalog vs a file system, there are also limits that make the CDS a poor choice for interactive users, with some notable exceptions. As noted in the CDS description, the service is designed for the needs of batch processing environments where latency of data access is not a significant concern. A common abstraction for any type of storage device, block addressable, object or tape for example is indeed an essential requirement as is the ability for data to flow transparently to and between any imaginable storage technology. This will by necessity imply some lowest common denominator rules in the service that affect the user experience, at least at the command line or API level CDS presentation.
The most notable rule is imposed by the need to support storage technologies that are by nature single threaded, like tape. When working with tape data, it is not possible to have more than a single user access data on the same tape volume at the same time, so the CDS imposes this same rule for all data on a per volume basis. That sounds like a significant limitation, but we have a historical example where it works very well.
Through the history of large system architectures and enterprise computing we have had Hierarchical Storage Management (HSM) systems that put a disk-based file systems in front of an automated tape or optical disc library to act as a cache so that the user is rarely cognizant of the latency of accessing data unless it happened to reside only on archive tier volumes. HSM processes are essential in the way that DeepSpace manages and presents data.
Basic HSM functionality relies on two capabilities. The first is service files in a file system as they are created and modified by copying them into the CDS. This process allows the data to inherit the capabilities of the CDS that include things like versioning and vaulting that we discussed in the CDS description, as well as the access to technologies like automated tape libraries that do not naturally fit into a file system presentation.
The second capability of any HSM system is the ability to act on every file open command to see if the file data needs to be made resident before allowing the access to proceed, so it can restage data, if necessary, in a manner that is invisible to the user. If you understand the idea of virtual memory and paging in a CPU context, this is the same process in a file system context when it is augmented with a near-line repository.
So if we look at the HSM data life cycle for a file, we start with the file create or modify events, specifically close on write process flow. Figure 1 shows the block diagram.
The close on write event in the current implementation comes from fanotify library which is part of the upstream Linux kernel. When I started developing what we now call DeepSpace back in the late 1990’s, that wasn’t available, so we used the Data Management API (DMAPI). DMAPI had significant warts and wasn’t included in the kernel because it wasn’t accepted upstream. Fortunately, eventually fanotify came about primarily for virus detection, but with obvious data management implications.
When our hsm daemon, smsd sees the events they trigger the file archive into the CDS namespace, update a file system side band database with the file’s metadata, and finally tag the file with an extended attribute (xattr) that has the file’s handle in the CDS. At this point the file is protected, and it can be duplicated in any file system on any CDS client in a stub state form. Stub state means the file has all file metadata, appears to be fully resident (ls -l will show the full file size), but the file is not really resident, it just appears so.
The process for a file open is shown in figure 2. Fanotify gives us a read_perm(ission) event that is synchronous, it will block the file open until we acknowledge it. While it is blocked, we do a check to see if the file is resident, if it is we acknowledge the open and it proceeds normally. If it isn’t resident, we read the file from the CDS using the handle that was previously stored in the file’s xattr, then we write the file back in place and acknowledge the read permission request.
In a nutshell, these two processes represent all of the plumbing that makes data management via SMS possible and transparent to the user. All of the SMS capabilities I will now describe are based on these two basic operations.
Operationally, What can SMS do?
First, let us state the CDS has rich attributes designed for long term curation of data. We’ll just start with adding an expiration policy. The CDS equivalent of a file system subdirectory is a volumeset. We create volumesets for files that are grouped for curation purposes, that means they share (initial) technology type assignment (tape or disk for example) as well as the type of expiration policy. In the file system SMS context, we’ll only use generational expiration because that is the only one that makes sense in practice.
So, the volumeset we’ll configure for anything from 1-10,000 generations before we start to expire. If I configure a volumeset for say 5 generations, once the 6th generation of a file is written, generation zero is marked as expired. That doesn’t mean it isn’t accessible, it just means that if I run a reclamation on the volumeset, the new replacement volumeset will have skipped over the expired files, then they will become unrecoverable. That means you will be able to rewind a file back to any state prior to reclamation, after reclamation rewinding to an expired version will fail.
So, when a file is written, it is archived as generation zero and the CDS handle is written to an extended attribute of the file. Every modification as defined by a write_close adds another copy in the CDS, and another handle appended to the file’s xattr. Using the “gens” command on the file name reads the xattr and displays the number and timestamp of each CDS version. The “active” version of the file can be changed with the rew(ind) command, along with a signed integer to specify how far to rewind the file back to in a generational context. Note that the process just described is a virtual operation in the sense that the file isn’t reverted back to an older state, the file is de-allocated to a stub state and the xattr pointer back to the CDS is changed to set the active version to a prior state. The file doesn’t actually get rewritten until it is opened again. This is an important distinction when we look at doing mass file recovery operations, like reversing a ransomware attack through point in time recovery.
This summarizes the first and most important aspect of SMS which is to make it virtually impossible to lose data, protection is nearly instant, and becomes a systemic file system characteristic when the SMS management is turned on.
At this point it should be clear the file system is no longer a storage platform per se, it is a virtualized presentation of a much more capable namespace in the form of the CDS. It also acts like a local cache for “working set” data. The CDS in turn enables any type of storage hardware, including technologies you could not use for a native file system, or if you did, they would exhibit poor performance in the translation.
So far I’ve only addressed data protection, SMS can do a lot more and it all rides atop the concept of being able to convert files back and forth between online and stub states automatically to leverage the CDS in practical terms. The next most used SMS capability would probably be management of file system free space so the user will never see a file system reach 100%, DeepSpace has a policy that regularly checks utilization levels and converts online files to stub state when a system comes under free space pressure so enospace crashes will no longer happen.
I touched on the side-band file system database in figure 1 and you might have wondered its purpose. As depicted, that database is kept synchronized in a lazy fashion at the conclusion of every file archive operation. The process is fully asynchronous, but typically fast from a human operator perspective. It all depends on the volume of file modifications/second which relates to the archive queue depth, and the performance of the archive target layer. Disk based object storage is very fast, small files are usually archived in a second or two. Larger files don’t take much longer, but tape storage adds the latency you would expect for library (or manual tape operator) mounts, plus tape transport seek times.
Once files are archived and updated in the file system database you have the ability to do things like fast file searches, these are typically done from the UI and are obviously orders of magnitude faster than dragging inodes through system memory with the “find” command.
A more interesting capability is what I refer to as file system synthesis. This is the process of executing a query against the file system database to select files based on the user criteria from amongst the entire data catalog, permissions withstanding, and “spray” stubs representing those files wherever they may be needed. The processes we’ve already discussed make those stubs effective copies of all of the files they refer back to in the catalog. The executable code that makes this happen is a small utility called “ds_touch”. This utility takes the database records that were created in the file system RDBMS and does an enhanced touch, it writes a stub with the extended attributes and the file’s original metadata, unless that original metadata is overridden which is also an option, with regard to ownership and permissions at least.
This particular capability has many operational implications. Consider a tech refresh where a file system storage tier that has grown to an unwieldy size needs to be moved to a new block storage platform. It’s a rather trivial operation to have the file systems being moved copied in stub form to the new platform, and then remounted. Since only metadata pointers are involved, this is a very fast process and can happen while the system is in production, and files are being updated.
Another use case involves provisioning virtual machines. With the file system synthesis, the provisioning of a complete set of local file systems using an SQL query is far more flexible than cloning since each image can be defined in as granular of a fashion as can be described in the SQL query.
A more broad spectrum view however redirects the system architect who is intent on having a massive global namespace to stop trying to scale file systems to unwieldy and unreliable size, and instead accepts the CDS as the scalable global namespace, with the file system presentations reflecting only the portion of that which the user actually needs for their own working set of data.
FAQ/Troubleshooting/General Discussion about the File System Namespace
1 post • Page 1 of 1