Archiving on cold storage

As an amateur photographer, I am dealing with terabytes of data that I would love to securely archive. This post deals with a Python script that I wrote to help me on that matter.

About archives

But what are archives and how are they different from classical backups?

Their purpose is not the same.

A backup is intended to provide a quick mean of recovering data that is currently in use or was recently used.
An archive is intended to store final set of data for a very long period of time. Typically, archives contain data that are not actively modified.

About cold storage

Cloud providers around the world propose many solutions concerning storage. The cheapest offer, such as Amazon Glacier or Google Coldline, is often qualified as cold. But what does that mean?

Cloud providers separate the data into different tiers, depending on how often they need to be accessed. The warmer the data, the faster we can access them. Thus the provider hosts them on fast and expensive hardware, such as SSD an RAM caches.

On the contrary, cold storage won’t be accessed very often, thus can be stored on slow (and cheap) hardware that can even be offline! According to Blackblaze:

“That includes data that is no longer in active use and might not be needed for months, years, decades, or maybe ever.”

Perfect for archiving purposes!

But there is a catch…
The access time to cold data can take an eternity, at least according to computing standards. Accessing a file can be a matter of minutes or even hours! Thus, most software used to backup data à la rsync won’t work.

Enter pyArchiver

Archiving on cold storage can seem cumbersome, but it comes with one advantage: it is usually very cheap! Which is not negligible when you’ve got many terabytes to store.

I wrote a python script, pyArchiver, which will handle the required “stores” and “restores”. As it cannot dialog with the distant server, it will keep track locally of the files that were archived and their state. It is then easy to add new files to the archived directory. You can even modify some of them afterwards. Although such an event should not occur frequently in your workflow.

Of course, the files can be cyphered, because who trust a cloud provider nowadays?

In order to work, all the files are indexed and their state is tracked in a local database. This is the one most important file that will give you access to the distant trove should you ever need it.

pyArchiver survey a directory where you put your files intended to be archived. It work in an incremental fashion, meaning that it will only upload new files or even modified files. But it does not support file versioning: it will restore your archive directory in its latest state.

In the current release pyArchiver can target the local storage (useful for testing) and OVH’s Cloud Archive via its very limited SFTP frontend. Indeed, this is the service that I subscribed for. It can also target any server offering SFTP.
pyArchiver uses independent classes to send and retrieve files according to the desired protocol. So it should not prove too difficult to write new ones in order to access Amazon Glacier or the cloud from Google.

pyArchiver is GPLed and is hosted on Github.

How to use pyArchiver

First you describe your archive with an ini file.
Then you start the archiving process, which will send the files to the cloud and create a .archive file.
If the transfer was interrupted or if you want to push new files later, you resume the .archive.
And of course you can restore the archive.

List of supported commands

pyArchiver <command> [<options>]

For each command you can type pyArchiver.py command –help

init

Initialises an empty ini file to configure your archive.

pyArchiver init <ini_file>

All the relevant instructions are written in the ini_file itself.

start

Starts a new archive, following instructions from an ini file. It will sends your archive to the provided server and produce a .archive file.

pyArchiver start <ini_file>

The .archive file will contain all the information necessary to resume or restore the archive afterwards. The .ini file can be discarded.

PLEASE NOTE that the .archive file contains the password used to cipher the files plus all the informations that are necessary to connect to your cloud provider. Do not store this file on an unprotected location.

resume

Resumes an archiving process. If your connection was interrupted, it will gracefully send any missing file. It will also send any new file or file that was modified since the last execution of a start or resume command.

pyArchiver resume <archive_file>

restore

Restores an archive. The files will be on the same state they had the last time start or resume were run. Any deleted files will also be restored.

pyArchiver restore <archive_file> <destination>

decipher

Can decipher the encrypted archived files even if the .archive file is not available. Those files have to be manually downloaded from your storage provider before being deciphered.

pyArchiver decrypt <archive_file> <destination> <password>

delete

Deletes the files on the distant storage. Caution, this command cannot be undone.

pyArchiver delete <archive_file>

update

Updates the archive file if was generated by an older version of pyArchiver.

pyArchiver update <archive_file>

about

Some basic information about the program.

About OVH

OVH is the biggest European provider concerning cloud services and web hosting. It is very very cheap, but its customer service is notoriously bad. Although it has proven reliable to me, if a problem arises you are (mostly) on your own.

As you may have understood, I am not affiliated in any way to OVH 😉