Object lessons

Editorial Type: Feature Date: 2020-06-01 Views: 1,343 Tags: Storage, Object Libraries, Research, AI, Workflow, HPC, OCF PDF Version:
Martin Ellis, pre-sales engineer at OCF, looks at how an object-based workflow approach can work for research data storage

Today’s “left-over” data can be the basis of tomorrow’s breakthrough. As we keep data for longer and try harder to share and re-use data, it becomes critical that data is accurately catalogued and easily retrievable. Although long touted as the saviour to the scalability crisis being waged on file systems, object storage still remains niche outside of web-scale deployments.

Parallel file systems such as Spectrum Scale are great at delivering multiple petabytes as a single file system but for some the volume of data is becoming not a technological issue but a human issue. The directory structure becomes too cumbersome for us to navigate and a new workflow is needed. In this space, I see object storage truly thriving.

At OCF we work with many object storage vendors. Many have a feature to use erasure coding to reduce replication capacity overheads.

When you’re going for site-level resiliency, object stores start to become unbeatable. For a traditional high availability 3-site file solution comprising mirrored active/active replicas and an asynchronous off-site DR copy your file will be taking up at least three times its size in storage capacity. With a 3-site object store, some technologies can get this down to 1.5x. Though personally I wouldn’t advise lower than ~1.7x as you still want some resiliency in addition to a whole site going down.

The greatest disadvantage here is that at 1.5x, or even 1.7x erasure coding over three sites, each site has less than a full copy of the data. If an object is to be retrieved it requires an inter-site network transfer and compute overhead to re-assemble the whole object. The result: space efficient dispersed object stores are inherently slow.

Team work

An object store should be a repository or means of sharing data. We shouldn’t try to replace file storage with objects, rather use the strengths of each together to achieve more.

As a specialist in HPC and research data storage, I view object storage with my HPC hat on. In HPC land, much of the largest data sets are generated by scientific instruments like high resolution microscopes, spectrometers and sequencers. In an object based workflow, files generated by these instruments can immediately be objectised, tagged with appropriate metadata like researcher, project, instrument settings, what was sampled, and conditions which the data should be shared (example as required by some funding bodies). The resulting objects are then ingested into an object store for preservation.

If a researcher needs to re-visit the output they can do, and they can easily cache whole projects on their local systems. With data catalogued in an object storage solutions metadata management system it can be published and shared, giving researchers wanting more data - but without extra funding - the ability to query past projects for similar instruments and samples.

Integrating with HPC

Although object stores tend to be too slow to efficiently support HPC resources, the programmatical nature of object interfaces allows them to integrate well into HPC workflows.

Depending on the scheduler and cluster management system being used, data can be pre-staged onto clusters’ local fast scratch file storage using API calls as part of the job submission script ready for when the job is allocated CPU time.

Similarly, any output written to the HPC’s scratch file storage can be assembled into an object and published to the object store. Objects can be tagged with not just date-time but also any input parameters and the submission script included allowing researchers to more easily manage and locate outputs from many similar but different iterations.

Fuelling AI

A growing number of our life science customers are adopting AI and an object-based workflow would be great here. For AI, you typically need a lot of data, often more than one practitioner can generate. The ability to pull many thousands of output files, tagged with what was sampled and observed from potentially hundreds of projects would be a treasure trove to AI researchers wanting to expand their dataset.

In conclusion although I do not foresee object storage replacing file storage for active research data, they do offer an excellent means to curate and preserve data efficiently in a geographically dispersed solution with a programmatical interface to support research computing systems.

More info: www.ocf.co.uk