Mirror a site for offline use

Manu Raghavan

2012-10-28

More than once in my life, I’ve found myself at an airport / train station / on a plane / boat / train / car, possessing copius amounts of spare time with no internet access.

Last month, when integrating an API for payment processor while strapped to my seat thirty-thousand feet over Turkmenistan, I came to the realization that the solution I’d discovered 4 hours earlier was worth stashing away as an easy reference.

All this offline downtime could have been better spent if I had reference material to work on several languishing pet projects. If only I’d remembered to download documentation for a particular API beforehand to refer to.

And not just one page or two, but all pages under a URI subdirectory recursively, with images and links converted for offline viewing, so there’d be no terrible discovery of blocking for large sections of time because you forgot to download one section.

The best solution I’ve found for this problem is wget‘s mirroring utility. I’m happy to report that this made working on a long flight across the world much more productive, giving me time on the ground to spend on a long-awaited reunion with friends instead of having to huddle near a WiFi hotspot to wrap up a project.

This example illustrates mirroring offline a subsection of a site.

How to use

# First, instruct wget to ignore the robots.txt instrctions
# on the domain(s) we're mirroring.
#
# This violates the publisher's preferences, so please ensure
# that you're not overloading a server or otherwise abusing 
# the publisher's work.
echo "robots = off" > ~/.wgetrc

# Download an entire URI subsection for offline viewing.
# In this case, this means everything under /docs/ruby
# that is recursively navigable from this starting point.
 wget -m -k -np -p -E https://www.braintreepayments.com/docs/ruby

Options

-m   # Use wget to set up a mirror, turning on recursive scraping
-k   # Convert links from relative URIs to file resources for offline viewing
-np  # Scope scraped resources to this URI and lower
-p   # Turns on scraping dependent assets like images, javascripts
-E   # Adjust extension to .html where the mime-type is application/html