Home > tools and resources > Archiving a site using httrack

Archiving a site using httrack

Recently someone asked me to help them archive a site.  They had lots of personal stories, pictures, scanned copies of kids school work, and so on that they wanted to preserve, but not necessarily on an active web site.  Basically they wanted an electronic scrap book they could keep for the future.  The site was using Drupal, and almost all of it was protected by passwords using a Drupal style posted form for logins.  The site was also not using rewrites on the URLs, so pages looked like index.php?q=node/125 and index.php?q=logout

My first thought was to just wget the site using the mirror and preserve cookies options.  The main problem with that approach is that one of the early links that wget followed was the Logout link on the main menu, so I’d get three or four pages into the site and then just get a bunch of “403 Forbidden” messages.  wget’s exclusion arguments didn’t help, because they aren’t usable on the query string, and that’s where the logout part of the link was located.

Fortunately, httrack did the trick.  At first I had decided not to use httrack because the help says that to work with sites that have logins you need to go through a process that involves setting your browser’s proxy settings, and I found the interface annoying.  Those complaints are not a huge deal, but I didn’t really want to bother.  As it worked out, I didn’t need to.

I got what I needed using the command line version of httrack, and I didn’t use the proxy workaround at all.  I logged into the web site in firefox, copied the session ID from the cookie in firefox, and put it in the cookie.txt file for httrack to use.  The cookies.txt file is documented here: http://httrack.kauler.com/help/Cookies
Its the same layout as wget uses for the same file, so I was actually able to use the file wget had created when I tried using it, and all I had to do was change the session ID.

The line in the cookie.txt file looked like this:

example.com        FALSE   /       FALSE   1999999999      PHPSESSID       1f85edbfc2db8e20af20489f7fb7b417

Obviously the session ID for each session is going to be different.   Although I used the file I already had, it would be easy to create this file by hand.

Then I ran httrack, omitting some of the links, and it grabbed the site nicely.

httrack “http://example.com” -O “./example.com” -v ‘-*logout*’ ‘-*edit*’ ‘-*user*’

That tells httrack to fetch http://example.com, place the resulting files at ./example.com, be verbose in its output, and omit any URL’s that included logout, edit, or user in them.  What this does better than wget is that it will omit any URL that matches the exclusions, even if they are in the query string.  wget only lets you exclude based on the directory, domain, or file name.

Overall, I found httrack did a very good job.  The naming of the files that resulted was cleaner than wget produces, at least for my purposes.  My only complaint about httrack was that although there’s plenty of documentation (if you count the information on httrack.kauler.com), it was hard to find what I needed.

1999999999
Advertisement
Categories: tools and resources
  1. December 19, 2010 at 1:05 am

    Exactly what I was looking for. Thanks so much for this post and keep writing excellent articles like this.

  2. December 31, 2011 at 2:22 am

    Excellent post, thank you very much! This is perfect for me.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: