Backup to the future

Editorial Type: Technology Focus Date: 2020-08-01 Views: 1,211 Tags: Storage, Backup, Tape, Strategy, Deduplication, Management, Exagrid PDF Version:
Bill Andrews, President & CEO of ExaGrid, examines the journey from simple tape backups to tiered disk backups that use adaptive deduplication for fast, reliable and affordable backup and restore solutions

An organisation cannot function without its data. As a result, data is backed up at least five days per week at virtually every company around the world. Data backup guards against short term operational and external events as well as for legal, financial, and regulatory business requirements:

  • Restore files that were deleted, overwritten or from before a corruption event
  • Recovery from a primary storage ransomware attack
  • Keep retention/historical data for legal discovery, financial and regulatory audits
  • Replicate to a second location to guard against disasters at the primary data location, such as earthquake, electrical power grid failure, fire, flood or extreme weather conditions

Due to all of these requirements, backup retention points are kept so that organisations have a copy of the data at various points in time. Most keep a number of weekly, monthly, and yearly backups. As an example, if an organisation keeps 12 weekly copies, 24 monthly copies, and 5 yearly copies - that amounts to about 40 copies of the data. This means that the backup storage capacity required is 40 times the primary storage amount.

Since backup policies require keeping retention copies, and the storage needed for backup is far greater than the primary storage, the industry has evolved over time to reduce the amount of storage required in order to reduce the cost of backup storage.

Backups were sent to tape for about 50 years because if organisations were going to keep 30, 40, 50, 60 copies of the data (retention points), then the only cost-effective way to keep those copies was to use a media that was very inexpensive per gigabyte. Tape solved the cost problem as it was inexpensive - but it was also unreliable because it was subject to dirt, humidity, heat, wear, etc. Tape also required a lot of management, including storing tapes in cartons and shipping a set of tapes offsite each week to another location or a third-party tape storage facility. Tape backups were great for cost but had many issues.

Disk solved all the problems of tape, as it was reliable, and it was secure since it was in a data centre rack behind physical and network security. Organisations could encrypt the data to replicate to a second data centre (no physical media to ship).

Disk was far too expensive per gigabyte until the year 2000 when enterprise-quality SATA drives were introduced. This dropped the price of backing up to disk dramatically, as SATA was reliable enough for backup storage. However, even at a lower cost, disk was still too expensive when you did the math of keeping dozens of copies.

All of the backup applications added writing to volumes or NAS shares to their products so that disk could be used. Disk was used as disk staging in front of tape, but not tape elimination. Backup applications would write one or two backups to disk for fast and reliable backups and restores but still write to tape for longer-term retention due to cost.

Although SATA disk was lower in price than any other enterprise storage media, it was still too expensive to keep all the retention on disk. In the 2002-2005 time frame a new technology, data deduplication, entered the market. Data deduplication compared one backup to another and only kept the changes from backup to backup, which typically is about 2% change per week. The backups were no longer kept as full backups as only the unique blocks were kept, greatly reducing the storage.

Data deduplication did not have much impact if there were only two or three copies and in fact, was not much different from just compressing the data. However, at 18 copies the amount of disk used was 1/20th that of not using data deduplication. You could store 1TB in a deduplicated form what it would normally take 20TB of disk to store without deduplication. The term 20:1 data reduction was used (assumed at 18 copies of retention). If the retention was longer, the data reduction ratio was even greater.

At this point, organisations could eliminate tape as the amount of disk required was greatly reduced, bringing the cost of backup storage close to that of tape. However, while these appliances added data deduplication in order to reduce storage they did not factor in the trade-off of the compute impact. These "deduplication appliances" performed the deduplication inline which means between the backup application and on the way to the disk. Data deduplication compares billions of blocks and therefore is extremely compute-intensive.

This compute-intensive inline deduplication process then, actually slows backups down and constitutes about one third the performance of writing to disk. Since the backups are inline all the data is deduplicated on the disk, which means each time you restore the data it has to be put back together to restore it, which is called rehydration. This rehydration process is slow and can take up to 20 times longer than restoring un-deduplicated data from disk.

The deduplication appliance used block level deduplication which creates a very large hash tracking table that needs to be kept in a single front-end controller. As a result, as data grows, only storage is added to the controller. If the data doubles, triples, quadruples, etc. then the amount of deduplication that has to occur also needs to be increased, but with a front end controller the resources (CPU, memory, network ports) are fixed, and therefore the same resources are used for four times the data as were used for one times the data.

As a result, the backup window grows and grows until you are forced to buy a bigger and more powerful front-end controller, called a forklift upgrade, which adds cost over time. The front-end controller approach relies on fixed resources and it fails to keep up with data growth, so the controllers are continuously being obsoleted to add more resources.

Even though inline scale-up (front-end controller with disk shelves) appliances lower the amount of storage resulting in lower storage costs, they greatly slow down backups due to inline deduplication, slow down restores due to only keeping deduplicated data (rehydration process), and don't scale, forcing future forklift upgrades and product obsolescence, adding long term costs. The net result is that they fix the storage cost problem but add to backup and restore performance issues and are not architected for data growth (scalability).

Customers used and still use data deduplication appliances however; the backups applications went through a phase where they tried to eliminate data deduplication appliance by integrating the data deduplication process into the backup media servers. The idea here was to just buy low cost disk and have deduplication as a feature in the backup application. This created many challenges.

The first challenge is that data deduplication is compute-intensive and the media server already has the task of taking all the backups and writing them to the media so that all compute resources are already being used. By adding deduplication to a media server, the CPU is crushed, and the backups jobs slow to a crawl. To solve this, backup applications increase the deduplication block size to do less comparison and use less CPU. Instead of using block sizes of 8kb they used (for example) 128kb. Instead of achieving the 20:1 deduplication ratio of a deduplication appliance, they achieve a rate of about 5:1 or 6:1. Also, they are slowing down the media server and all data is deduplicated on the disk so restores are still slow.

Lastly, the same scaling issues remain. Some of the backup application companies packaged up the media server with deduplication with a server and disk to create a turnkey appliance however the challenges still exist: slow backups, slow restores, scalability issues, and the cost is higher since they use a lot more disk than a deduplication appliance because they have a lower deduplication ratio due to a larger block size.

There is no doubt that disk is the right medium. It is reliable and lives in a data centre rack with both physical and network security, both onsite and offsite. If data is backed up to disk without data deduplication the backup and restore performance is great, however the cost is high due to the sheer amount of disk required.

Using an inline deduplication appliance, you can reduce the high cost of storage due to the 20:1 deduplication ratio. However, all of these appliances are slow for backups due to inline deduplication processing, slow for restores due to only keeping deduplicated data that needs to be rehydrated with each request, and they don't scale as data grows which grows the backup window over time and forces costly forklift upgrade and product obsolescence.

If deduplication is used in a backup application the performance is even slower than a deduplication appliance as the CPU is being shared between the deduplication process and media server functionality. The backups applications can improve this with incremental backups only but there are other trade-offs. In addition, far more disk is required as the deduplication ratio is more in the range of 5:1 to 10:1 rather than 20:1.

There is no free lunch here and the different storage methods are just pushing the problem around. Why is that? Because unless you build a solution that includes deduplication and also solves the backup performance, restore performance, storage efficiency, and scalability issues - then no matter where the deduplication lives, the solution will still be broken. The answer is a solution that is architected to use disk in the appropriate way for fast backups and restores, uses data deduplication for long-term retention and scale-out all resources as data grows.

Tiered backup storage offers the best of both worlds, both disk without deduplication for fast backups and restores, and deduplication to lower the overall storage costs. The first tier is a disk-cache (Landing Zone) where backups are being written to standard disk in their native format (no deduplication to slow it down).

This allows for fast backups and fast restores as there is no deduplication process in between the backup and the disk, and the most recent backups are stored in an un-deduplicated format. As the backups are being written to disk, in parallel with backups coming in, the data is deduplicated into a second tier for longer-term retention storage. This is called Adaptive Deduplication (it is not inline, and it is not post process). The system is comprised of individual appliances that each have CPU, memory, networking, and storage, and as data grows all resources are added which keeps the backup window fixed in length as data grows and eliminates both forklift upgrades and product obsolescence.

The net is:

  • Backups are as fast as writing to disk as there is no deduplication process in the way
  • Restores are fast as there is no data rehydration process, because the most recent backups are in a non-deduplicated form
  • Cost is low upfront because all long-term retention data is deduplicated in the long-term repository tier
  • Backup window stays fixed in length as data grows as the architecture is scale-out, adding all resources and not just disk as data grows
  • Long-term costs are low as the scale-out architectural approach eliminates forklift upgrades and product obsolescence

In summary then, backup storage has taken a long journey and has arrived with tiered backup storage that provides fast and reliable backups and restores, with a low cost up front and over time.

More info: