Manu Raghavan

Docs on a Plane!

More than once in my life, I’ve found myself at an airport / train station / on a plane / boat / train / car, possessing copius amounts of spare time with no internet access.

Last month, when hacking out API integration with a new payment processor while strapped to my seat thirty-thousand feet over Turkmenistan, I came to the realization that the solution I’d discovered 4 hours earlier was worth sharing with the wider world.

All this offline downtime could have been better spent if I had reference material to work on several languishing pet projects. If only I’d remembered to download documentation for a particular API beforehand to refer to.

And not just one page or two, but all pages under a URI subdirectory recursively, with images and links converted for offline viewing, so there’d be no terrible discovery of blocking for large sections of time because you forgot to download one section.

The best solution I’ve found for this problem is wget’s mirroring utility. I’m happy to report that this made working on a long flight across the world much more productive, giving me time on the ground to spend on a long-awaited reunion with friends instead of having to huddle near a WiFi hotspot to wrap up a project.

This example illustrates downloading/scraping the documentation for a payment processor API that I wanted to download reference material for.

Result

Usage

1
2
3
4
5
6
# First, instruct wget to ignore the robots.txt instrctions 
# on the domain(s) you're scraping from.
# This is generally evil, expressly violating content rights 
# in some cases, so please ensure that you're not overloading 
# a server or stealing/redistributing paid content.
echo "robots = off" > ~/.wgetrc
1
2
3
4
# Download an entire URI subsection for offline viewing.
# In this case, this means everything under /docs/ruby
# that is recursively navigable from this starting point.
 wget -m -k -np -p -E https://www.braintreepayments.com/docs/ruby

Options explained:

1
2
3
4
5
-m   # Use wget to set up a mirror, turning on recursive scraping
-k   # Convert links from relative URIs to file resources for offline viewing
-np  # Scope scraped resources to this URI and lower
-p   # Turns on scraping dependent assets like images, javascripts
-E   # Adjust extension to .html where the mime-type is application/html

Comments