What is a Catalog Data Service?

FAQ/Troubleshooting/General Discussion about the Archive Namespace

Moderators: Steve Cranage, Robert Plaster

Post Reply
User avatar
Steve Cranage
Posts: 11
Joined: Wed Jul 21, 2021 10:56 pm
Location: Colorado Springs, CO

What is a Catalog Data Service?

Post by Steve Cranage »

What is a Catalog Data Service?
In the early days of Enterprise Computing Operating systems like IBM’s MVS were designed to run mission critical applications, and to leave no room for data corruption or loss; When dealing with bank and other financial records data corruption could obviously result is disastrous monetary loss that had to be avoided.

What the operating system developers created as the storage system in those days did not resemble the file system architecture most IT professionals are familiar with today because the requirements of enterprise computing were not designed to support large numbers of interactive users. The model was instead based on running these mission critical applications in a batch processing environment, think of running bi-weekly payroll or customer billing records as examples. Some requirements for a storage subsystem for the enterprise might look like the following:

• Extreme scalability
• Version control
• Off-site vaulting support
• System processes for long term data management

These requirements presented engineering challenges that would take a lot of engineering resources to solve, and those resources were only available due to the data center costs associated with these enterprise budgets, which were typically in the 10’s of million dollars and above. The architecture that resulted was an object storage model, based on a switched channel architecture for the object storage targets, and a shared database for the file metadata. I purposely used modern terms to describe the architecture of the day but should note this terminology did not exist at the time it was created.

The basic concept for running a batch “job” was to call the executable program by name and set its inputs and outputs with DD statements as arguments to the job. Notably, input data set names were just that, the name of the file(s) without any pointer to where the file happened to reside, the lookup was done inside of the catalog service. Think of it as going into a library and asking the librarian for a book by its title, then watching the librarian do a lookup in the Dewey Decimal based card catalog to find where to retrieve the book using its catalog index. You will not remember this example unless you were born before the 90’s but it’s the way we used to find books on any subject before the advent of the internet.

Output files had one more argument passed besides the file or data set name, that was referred to as the esoteric, and that defined a type of storage to direct the output file to, based upon cost and performance characteristics. The object storage targets were some combination of disk and tape volumes, and these were further divided by capacity and performance. It was typical to have performance and capacity classes of both disk and tape storage systems arranged in a hierarchy, with each layer in that hierarchy being identified with a unique esoteric.

Managing the relationship between data sets and storage tiers was done by another system process and it was able to change file residency based on business rules without affecting the daily operations because the catalog served to eliminate the need to ever know where a file was stored, that was immaterial other than in the sense that performance would be as dictated by current residency within the hierarchy.

Version control was implemented by a concept referred to as Generation Data Groups. Without going into too much detail, a simple example would be creating a file gives it a generational attribute of zero. The first update to the file results in another copy being created with a generation of 1, and so on up to 10,000 generations. Accessing files without a generation specified returns the latest generation. The use of an unsigned integer resolves the file being requested with an absolute value, use of a signed integer returns a file with a generation that is relative to the latest generation. So, for example, specifying a file name with a G-1 value gets you back to the last version of the file if you accidently corrupted the current version, or a figurative “undo”.

Lastly, vaulting was another system process that would copy data to exportable tape volumes and generate a pick list for the volume serial numbers that contained the data that needed to be conveyed to the offsite location. Those volumes and the accompanying catalog data would be all that was needed to restore the processing environment at the offsite location. There was no need to “restore” anything in the sense that a modern IT professional with a career in client/server architectures would think. These tape volumes are used for direct program I/O. Files are written individually, with file metadata in the form of labels that wrap the data sets when they are recorded. The catalog service at the remote site does the same device allocation and mount request and label processing that was done to create the data in the first place, the catalog data service just makes it all work automatically.
Steve Cranage
Principal Architect, Co-Founder
DeepSpace Storage

Post Reply