Backing up a Google Account on AWS Glacier Deep Archive

Several years ago, I had a real computer with a real hard drive, and used it as the primary storage of my photos and documents. I supplemented this storage with a cloud backup.

Now, my primary storage is the cloud, and I rely on Google to back up files. But there are a few scenarios where I would still want a backup:

  • I lose access to the Google account (could happen if the account is mistakenly flagged as abusive or fraudulent).
  • I accidentally delete data from my Google Drive
  • Some malware gains access to my Google account and deletes the files
  • Google has a software bug that causes my data to accidentally be deleted (this has happened before)

Backup on AWS Glacier Deep Archive

About once per year, I back up all of my Gmail, Drive and Photos files on AWS Glacier Deep Archive. To do this, I:

  1. Initiate a Google Takeout request.
  2. Wait a few hours for the takeout files to become available.
  3. Spin up a huge VM on AWS in a region somewhere with cheap storage, like US-East (Ohio)
  4. From my laptop, log into Google with Firefox, click to download each of the Takeout files, and then cancel all of the downloads.
  5. For each cancelled download, right click, then choose “Copy Download Link”.
  6. Copy the link to the huge AWS EC2 instance, and use curl or wget to download it to the local disk.
  7. Copy all the files to an S3 bucket in the same region as the EC2 instance.
  8. Set a lifecycle management policy on the S3 bucket to automatically move all files in that bucket older than one day to the Glacier Deep Archive Storage class.
  9. Destroy the huge AWS EC2 instance.

Notes:

  • Previously, I would install X and a browser on the VM, then log into my Google account and dowload the files using the browser. Then I saw the “copy download link” trick from Joseph DeFazio. Thanks!
  • Network and disk performance depend on the size of the disk allocated and the number of CPUs allocated, so be sure to create a really big instance.
  • If you do this right, there is no bandwidth cost, since downloading from Google is free, ingress into the EC2 instance is free, and bandwidth from an EC2 instance to an S3 bucket in the same region is free.
  • You could just download the Takeout files to your laptop, then push them to the S3 bucket without using an EC2 instance. I don’t do this because: my download and upload bandwidth are small enough that it would take days to do this backup, and my laptop doesn’t have enough storage capacity to hold all the files at once.

Why Glacier Deep Archive?

  • It’s cheap (< $1/TB/month)
  • It’s not Google, so I may still have access to the AWS account even if I lose the Google account.
  • It’s not Google, so a software bug at Google won’t delete my data in S3.