Make a Personal Offline Subset of Wikipedia

February 4, 2023

Today’s personal computing devices have at least dozens of gigabytes of storage, and usually much more. When you think about it, that’s pretty amazing: You could probably store your nearest library’s entire book collection on your laptop’s hard drive, maybe even your phone.

But do you actually carry a library collection in your pocket? Probably not. Most people nowadays carry devices that provide an endless supply of information – until they go offline, at which point they become not much more than glorified digital cameras.

Even if you’re rarely offline, you have to admit that this state of affairs is pretty lame. You have a library in your pocket, but there’s nothing to read on the shelves.

How can you fix that?

Enter Kiwix. It’s a program for viewing Zim files, which are compressed collections of web pages that can be accessed offline. You can get the full Project Gutenberg library of public domain books in 75GB. You can get a text-only copy of English Wikipedia in 50GB.

But even if you’re enthusiastic about having offline reading material, 50 or more GB of it that might be too much. You can instead get the top 50,000 most important articles for a few hundred megabytes in text-only or a couple GB with images. That should be enough for most people.

But you’re not most people, are you?

I found that the top 50,000 articles was good enough for general topics but that the depth wasn’t sufficient in particular areas of interest. So I did what anyone would do and researched how to make my own Zim files. What I wanted was to the top N Wikipedia articles, plus a large selection of articles in my areas of interest.

I found mwoffline, which takes a list of pages on a Wiki and produces a zim file containing those pages. E.g. for a text-only collection:

mwoffliner \
  -mwUrl=https://en.wikipedia.org/ \
  --adminEmail=foo@bar.net \
  --articleList articleList \
  --format nopic

Where articleList is a text file listing Wikipedia articles by the part of the url the comes after https://en.wikipedia.org/wiki/. So for instance:

Wikipedia
Kiwix
Wikimedia_Foundation
…

Note that the pages listed in this file can’t contain url-encodings, so it needs to be Germaine_de_Staël and not Germaine_de_Sta%C3%ABl.

That’s cool, but how do you get a list of articles you want? I wrote a quick script to query the Wikipedia API for a several pages with a large of number of links I’d be interested in and output those links in a list.

#!/usr/bin/env python3

# get-articles-list.py

from urllib.parse import unquote
from bs4 import BeautifulSoup
import requests
import json
from sys import argv

pages = argv[1:]

encountered = set()

for page in pages:
    try:
        response = requests.get(
          'https://en.wikipedia.org/w/api.php?' + '&'.join([
            'action=parse',
            'prop=text',
            'format=json',
            'formatversion=2',
            'page=' + page
          ])
        )
        json = response.json()
        text = json['parse']['text'];
        soup = BeautifulSoup(text, 'html.parser')
        for a in soup.find_all('a'):
            href = a.get('href')
            if href and href.startswith('/wiki/') and not (
                href.startswith('/wiki/Category:') or
                href.startswith('/wiki/Wikipedia:') or
                href.startswith('/wiki/File:') or
                href.startswith('/wiki/Template:') or
                href.startswith('/wiki/Special:') or
                href.startswith('/wiki/Portal:') or
                href.startswith('/wiki/User:') or
                href in encountered
            ):
                encountered.add(href)
                print(unquote(a['href'].replace('/wiki/', '')))
    except:
        pass

There are a lot of Wikipedia pages that are mostly links to other pages. To find ones for your interests, try searching “outline,” “portal,” “history of,” “index of,” and “list of”. For general topics, there are five levels of “vital articles”, with a larger number at each level. I took all of the level 4 articles plus extras in my areas of interest.

Example usage:

./get-articles-list.py >articleList \
  History_of_computer_science \
  History_of_philosophy \
  Index_of_computing_articles \
  Index_of_language_articles \
  List_of_philosophical_concepts \
  List_of_language_families \
  Outline_of_chess \
  Outline_of_computer_science \
  Outline_of_computing 
  Outline_of_linguistics \
  Portal:Free_and_open-source_software \
  Portal:Visual_arts \
  wikipedia:Vital_articles/Level/4/Arts \
  wikipedia:Vital_articles/Level/4/Biology_and_health_sciences \
  wikipedia:Vital_articles/Level/4/Everyday_life \
  wikipedia:Vital_articles/Level/4/Geography \
  wikipedia:Vital_articles/Level/4/Mathematics \
  wikipedia:Vital_articles/Level/4/People \
  wikipedia:Vital_articles/Level/4/History \
  wikipedia:Vital_articles/Level/4/Philosophy_and_religion \
  wikipedia:Vital_articles/Level/4/Society_and_social_sciences \
  wikipedia:Vital_articles/Level/4/Technology

This will run pretty fast, since it only needs to query the pages you specify. I edited the resulting articleList file to add every Wikipedia article in my browser history/bookmarks and remove some that seemed unlikely to be useful.

The last step is to run mwoffliner. This can take hours depending on the size of your list.

mwoffliner \
  -mwUrl=https://en.wikipedia.org/ \
  --adminEmail=foo@bar.net \
  --articleList articleList \
  --format nopic

The resulting zim file will appear in a directory called out. You can load it into Kiwix and finally have, if not a library, a fantastic collection of information available offline.

A bonus I found is that the “random article” button for my offline Wikipedia subset is much more useful than the one on full Wikipedia. The latter usually takes you to something like “Lands administrative divisions of New South Wales”, whereas the former generally results in something of interest to me.