Backup at Scale – Part 2 – MapReduce for Backup

Stretch the analogy

mapreduce-logo[1] So mentioning MapReduce in connection with backup will probably get lots of funky agile programmers rolling their eyes at me but hey! I am a simple guy and saw an analogy that might work… let’s see.

So previously we found that in order to backup at scale we need to automate the living daylights out of our backup processes.  This can be done by using off the shelf products like EMC Avamar integrated with vCloud Director or by bespoking your backup environment yourself (this is really only for a few huge Google scale environments).

Distribute your load

So onwards with the shoehorned MapReduce analogy;  my simple minded view of the MapReduce process is as follows:

My Simple View

In order to backup at scale we really need to do the same type of distribution of the workload and then collection of the results.  So a backup system built around the MapReduce architecture would exhibit this type of workflow:  

Backup MapReduce

In traditional backup architectures you would have to roll out backup clients to all these application or file servers in order to get them to do a backup.  This locks the backup servers into doing a load of IO and encapsulating all the backups into a proprietary backup format which is not massively scalable (big, but not huge). 

However the more modern scalable approach is to integrate with the backup function supplied by the application, get this function to write the data to some protection storage (a deduplication appliance for instance) and then to report back to a central catalog that the backup is done.  This way you can more easily scale your backup catalog because that server isn’t bogged down with the workload of actually moving the data around.  So schematically the architecture would look like this:

 mapreduce backup architecture Summary

To backup at scale, take the IO workload away from the backup server, distribute it throughout the enterprise, using the resource on the application servers.  Send the backups directly to the protection storage in the application native format to make it simple for recoveries.  Create a central backup authority for maintaining a backup catalog, enforcing the backup policies, collecting alerts and providing operational and chargeback reports.

Summary of the two articles on how to backup at scale – Automate and Distribute, simples…

And if this looks a bit like the EMC data protection vision it is completely coincidental! … honest 😉


Backup at Scale – Part 1 – Linear is badness

linearIn a few technologies recently we see that, by design, performance grows linearly as building blocks are added.  In clustered systems a building block will include CPU, Memory and disk resulting in linear growth of compute performance and capacity.  In the backup world linear just doesn’t cut the mustard.

Who cuts mustard anyway?

Don’t get sidetracked with silly questions like that, use Google!  What I am trying to say is that for backup systems there is a requirement for the “work done to achieve backups” to grow significantly slower than the growth of data to protect.

Imagine a world where 1TB of protected data requires 10% of a building block of “work done”.  Where “work done” is a combination of admin time, compute, backup storage etc.  If our backup processes and technologies required a linear growth of work done then much badness occurs.  Diagrammatically…


No one would ever get to the situation described in the diagram above as they would soon realise that “this just ain’t workin’” and rethink their systems.  However the question is what should the “work done” growth look like?  It needs to be a shallower growth curve than that of the data protected and needs to slow as the capacities increase.  So we can imagine that we would want to achieve something like this:

slow growth

But how… How… HOW!?!

A number of methodologies can be employed to work towards this goal.  The first and most obvious step is to A-U-T-O-M-A-T-E (sounds better if you say it in a robotty way).

Phase 1 -Take the drudge processes (and believe me there are plenty) and automate them:

  1. Checking backup logs for failures
  2. Restarting backups that have failed
  3. Generating reports

Phase 2 – Take some of the more difficult but boring jobs and automate them too!

  1. Restore testing
  2. New backup client requests
  3. Restore requests

If your environment is at Google scale you may want to automate crazy things like purchasing, receipt and labelling of new backup media.  This is an extreme case but you get the principle, break down the tasks done in the backup process and see what you can get machines to do better and more accurately than humans.

There are plenty of people that have already done all this and many products to look at for help. Start Googling…

Is that it? – No, we will return with other methods to help backup at scale


Beginners Guide to Data Protection Strategy – Collectors Edition



Last summer while recovering from a knee op, I unburdened myself in a series of blog posts about the basics of a data protection strategy.  Lucky for you! the box set has just been released.  Here it is!


Beginners Guide to Data Protection Strategy – Part 1

Beginners Guide to Data Protection Strategy – Part 2

Beginners Guide to Data Protection Strategy – Part 3

Beginners Guide to Data Protection Strategy – Part 4

Beginners Guide to Data Protection Strategy – Part 5

Beginners Guide to Data Protection Strategy – The End

When I say box set, it is basically a lazy blog post, stop judging me …