tools and resources | David's Tech Blog

Archiving a site using httrack

November 1, 2009 David 2 comments

Recently someone asked me to help them archive a site. They had lots of personal stories, pictures, scanned copies of kids school work, and so on that they wanted to preserve, but not necessarily on an active web site. Basically they wanted an electronic scrap book they could keep for the future. The site was using Drupal, and almost all of it was protected by passwords using a Drupal style posted form for logins. The site was also not using rewrites on the URLs, so pages looked like index.php?q=node/125 and index.php?q=logout

My first thought was to just wget the site using the mirror and preserve cookies options. The main problem with that approach is that one of the early links that wget followed was the Logout link on the main menu, so I’d get three or four pages into the site and then just get a bunch of “403 Forbidden” messages. wget’s exclusion arguments didn’t help, because they aren’t usable on the query string, and that’s where the logout part of the link was located.

Fortunately, httrack did the trick. At first I had decided not to use httrack because the help says that to work with sites that have logins you need to go through a process that involves setting your browser’s proxy settings, and I found the interface annoying. Those complaints are not a huge deal, but I didn’t really want to bother. As it worked out, I didn’t need to.

I got what I needed using the command line version of httrack, and I didn’t use the proxy workaround at all. I logged into the web site in firefox, copied the session ID from the cookie in firefox, and put it in the cookie.txt file for httrack to use. The cookies.txt file is documented here: http://httrack.kauler.com/help/Cookies
Its the same layout as wget uses for the same file, so I was actually able to use the file wget had created when I tried using it, and all I had to do was change the session ID.

The line in the cookie.txt file looked like this:

example.com FALSE / FALSE 1999999999 PHPSESSID 1f85edbfc2db8e20af20489f7fb7b417

Obviously the session ID for each session is going to be different. Although I used the file I already had, it would be easy to create this file by hand.

Then I ran httrack, omitting some of the links, and it grabbed the site nicely.

httrack “http://example.com” -O “./example.com” -v ‘-*logout*’ ‘-*edit*’ ‘-*user*’

That tells httrack to fetch http://example.com, place the resulting files at ./example.com, be verbose in its output, and omit any URL’s that included logout, edit, or user in them. What this does better than wget is that it will omit any URL that matches the exclusions, even if they are in the query string. wget only lets you exclude based on the directory, domain, or file name.

Overall, I found httrack did a very good job. The naming of the files that resulted was cleaner than wget produces, at least for my purposes. My only complaint about httrack was that although there’s plenty of documentation (if you count the information on httrack.kauler.com), it was hard to find what I needed.

1999999999

Categories: tools and resources

reCAPTCHA

February 10, 2008 David Leave a comment

If you are looking for a way to prevent post/comment spam, account request spam, or to obscure email addresses, one of the better options at the moment appears to be reCAPTCHA.

There are several reasons to choose reCAPTCHA, some of which are outlined on the web site. Aside from having this done by someone who knows what they are doing and do it pretty well, one of the key benefits I see of using this particular system is that the text is fairly easy for a human to read. Recently on another site that used some other captcha system it took me three or four tries every time I had to solve the “puzzle.” reCAPTCHA doesn’t have that particularly annoying problem. You want to keep the spammers out, but you also want to avoid being so annoying to legitimate posters that they find some easier place to post.

Categories: tools and resources

David's Tech Blog

Archive

Archiving a site using httrack

reCAPTCHA

Recentish Posts

Top Posts

Archives

Meta