Heal Your Church WebSite


Teaching, rebuking, correcting & training in righteous web design.

Mirroring Websites with wget, curl and/or tar

From time to time, it is a good thing to back up your entire site onto a different computer/server, even if your entire site is database-driven.

Take my situation two weeks ago, when my former host provider shut down the RBC site because of a false spam report. Even though no spam had actually been sent from redlandbaptist.org, my former host’s upstream provider demanded action. In response, my former shut down the site with no notification despite their existing terms of use to warn on the first time, suspend on the second, etc… After I threatened legal action, the host provider and I found a middle ground where he agreed to put the site online, and I agreed to move my sites elsewhere over the course of a weekend.

Fortunately, I was already moving sites off this host I personally found troublesome. The Redland site was the last to go. RBC was last because there are legacy portions of it that are not entirely data-driven. That is, with my other sites, I merely dumped the MySQL database off the old machine, install a fresh version of MovableType on the new site, pipe in the data, hit rebuild and viola!

While there are several PC based options for mirroring a site, I wanted something that would take files from one Linux server and move them to another. I had three choices. First, there is wget, a nice little GNU tool for offline reading and site mirroring. As Jim Roberts writes in his article “Mirroring Websites with wget“, the syntax is insanely simple:

wget -mirror -w 3 -p -P c:\sokkit\site\rbc ftp://username:password@ftp.redlandbaptist.org

My only problem with this approach is that much of the legacy stuff at RBC was image related and/or offline because it is seasonal. And though I employed the “-p” option, not all images made the cut, nor did any of our offline archives for obvious reasons. So another solution, at least where images are concerned, would be to use a Perl program that employs another command line download tool simply known as curl:

perl curlmirror.pl -p -s 800 -o rbc -t /home/backupsite/rbc http://www.redlandbaptist.org

Unfortunately, since I have several files with a .php extension, I get the same filename mangling that I would with wget’s offline reader syntax:

wget –mirror -w 2 -p –html-extension –convert-links -P c:\sokkit\site\rbc http://www.redlandbaptist.org

The above syntax converts urls like http://www.redlandbaptist.org/index.php?sid=123 into http://www.redlandbaptist.org/index.phpsid_123.html. Great for viewing, not so great as a working backup.

There was another way, one that insured I got all my files, all the correct paths and all the correct file ownerships and permissions. Unfortunately, this method required shell access, and though my former host provider was kind enough to put the site back online, I seriously doubt he would have honored any request to restore ssh access. So I cheated, I downloaded a ‘modified copy’ of the Gamma Web Shell.

‘Modified copy?’ The Gamma Web Shell allows an individual to execute shell commands directly from their browser, so you can imagine the security implications of installing such a program. So on my local PC, I first modified the password to something huge and random, I limited the commands allowed, and then I changed the code to accommodate changing the file name from WebShell.cgi to something hard to guess. Once modified, I ftp’d it to the old Redland site, and then entered the following commands:

mysqldump -uUSRNAME -pPASSWRD -opt mydatabase > databasedump.sql

tar -zcvf shebang.tar.gz *

Please be warned, if you go this route, you are putting your site at great risk. Don’t blame me if you get hacked. You have been warned. In fact, I only did it because my back was against the wall. That said, immediately after executing the above backup commands, I deleted the shell program. Then uploaded a text file full of Lorem Ipsum using the same name to make sure it couldn’t be reconstituted from the trash.

After these security precations, I FTP’s the backup file to the new host, invoked the command “tar -zxvf” and was back in business almost instantly. I also FTP’d the tar.gz backup file to my home PC and ‘burninated‘ a CD as a ‘suspenders and belt‘ precaution.

So how about you? What’s your method of mirroring and/or moving sites? Remember what I said this past August, “if you fail to plan, then you’re planning to fail.

Comments are closed.