Build your own “Mini Wayback Machine”

The “Wayback Machine” is one of the more important services in the history of the Internet (and happens to be named after a great gag on the old Rocky & Bullwinkle show). From about 2001 through 2005 it could be counted on it to give you a reasonable snapshot of many of the home pages that existed on the Web as were archived by the old Alexa webcrawler going back to as early as 1996. Likely due to legal complaints or expensive maintenance costs the regular snapshots petered out around 2005.

I do enjoy going back and revisiting projects I used to work on (and goofy hacks!) like:

  • the first corporate website I built and managed in 1996 for Liberty Check Printers in St. Paul, Minnesota
  • when I lived at Sanctuary Arts and managed our website I would automatically insert an animated snowstorm as the background image during a winter weather warning
  • circa 2005 I was overly excited that I might soon boot windows directly into Firefox as my window manager and local desktop shell

Because I wasn’t able to find another service to replicate the functionality of the Wayback Machine I decided to write my own routines to create my own daily snapshots. Here’s a simple shell script that I run nightly by placing it in the /etc/cron.daily directory on my Debian-based Linux distro:

wget-figital.sh

#!/bin/sh
wget --directory-prefix=/home/sfitchet/archive/figital.com/$(date +%Y-%m-%d) -E -H -k -K -p -nd http://www.figital.com

# don't forget to change the --directory-prefix to a location on your
# own system and obviously the URL you'd like to archive ;)

Here’s a quick rundown of what’s happening:

  • every evening wget retrieves a copy of the HTML page at http://www.figital.com
  • inside my home directory a new folder is created based on today’s date for all of the files that will be needed to render my archived page in the future and with the help of the following command line flags:
FlagAliasDescription
-E–adjust-extensionif the requested file appears to be an HTML document but does not end with an HTML extension (for example, “.asp”) then rename the file with using the extension “.html”
-H–span-hostsallow downloads from other domains if necessary
-k–convert-linksconvert links to enable local (offline) viewing
-K–backup-convertedsave original copies of any edited files with using a .orig extension
-p–page-requisitesdownload any supplemental files needed to render the document
-nd–no-directoriesstore all the files in a single directory instead of creating a new directory for each unique hostname

It’s that easy. Be careful not to over-ping the servers you will be archiving or fill up your hard drive with poorly designed usage of wget’s recursion flags.

Comments

Contact Us

We'd love to hear from you. Get in touch!

Phone

+1 617-379-2752

Mail

P.O. Box 961436
Boston, MA 02196