Automatically Generating Anki Vocabulary from Books

The tl;dr of it all is this is harder than you’d think, not for the reasons you would think, and you’re better off just hiring a tutor to read alongside.

This living article documents some tools and methods and my own experiences. The idea that inspired this was that I have all these books in some of my weaker (natural) languages that I’m trying to read but oh, the vocab, it slows me so! If only there were a better way… What if I scanned the pages, ripped the text from the scanned images, tokenized and standardized the text to remove conjugations and declensions, and finally created Anki cards with translations? That would be awesome right, and I bet super easy! Well, just you wait and see.

Set up

You don’t need to use Python. In fact, I highly discourage it. But as you’ll see, it does have all the software you need, whereas some other high level languages I considered were occasionally wanting.

Requirements 6–8 are dependent on what data sources you’re working with. Even requirements 4–5 may be unnecessary depending on what you’re doing. Really, if you have your data and want to produce cards all you need is Python and genanki.

If you need to work with web data, for example for fetching translations or even whole texts from Project Gutenberg, then you will want requests.

If you are going to be working with images, either from screenshots, photos, or scanned documents, you’ll want Tesseract, and having the Python libraries makes it easier to get it all into NLTK, but you can just use the CLI tool and read the resulting text documents.

If you’re planning on working with audio recordings from podcasts, songs or movies, then you’ll want SpeechRecognition, and depending on what service you plan to use and how often you plan to use it, you may very well like to get an API key. You also may need some additional software for converting between audio formats as SpeechRecognition is partial to the FLAC format.

Converting audio clips

This may be one of the easiest parts.

import speech_recognition

recognizer = speech_recognition.Recognizer()
with speech_recognition.AudioFile(path_to_audio_file) as source:
    audio = recognizer.record(source)
    working_text = recognizer.recognize_google(audio, language='es_US')

There are other options than recognize_google that can be used. They’re also pretty good even without identifying a language, but better to give the hint. My first test with this was a clip of myself saying “todo lo que quiero” which, when told to expect Spanish would transcribe exactly, and when not, would produce “totally quiero”.

Some of these also require API keys, which if you plan to use them regularly, you should definitely get one instead of relying on the library defaults.

Converting screenshots/scans/photographs

This is only slightly more difficult than audio clips. The code itself is actually very simple, but collecting all the dependencies can be annoying.

from PIL import Image
import pytesseract

# tesseract uses some unusual language names so best to check first
pytesseract.get_languages(config='')
working_text = pytesseract.image_to_string(Image.open(path_to_image), lang=lang_name)

I highly recommend setting the language. Tesseract doesn’t seem quite so smart as SpeechRecognition. It will also help if your images are straight, especially when dealing with vertical text. By rotating an image 6 degrees in GIMP, I went from gibberish to a 95% accurate conversion.

Creating Anki cards

This is what we will use genanki for, and you can see more documentation at their site. I’m going to use the JP data from the “Caveats” section below to generate a set of Kanji cards.

import genanki
note_model = genanki.Model(
    some_guid,
    'Kanji with readings and meanings',
    fields=[
        {'name': 'Kanji'},
        {'name': 'On readings'},
        {'name': 'Kun readings'},
        {'name': 'Meanings'},
    ],
    templates=[
        {
            'name': 'Kanji card',
            'qfmt': '{{Kanji}}',
            'afmt': '{{FrontSide}}<hr id="answer" />{{On readings}}<br />{{Kun readings}}<br />{{Meanings}}',
        },
    ])

deck = genanki.Deck(another_guid, 'Kanji from Bach biography')

for kanji in [r.json() for r in results if r.status_code == 200]:
    deck.add_note(
        genanki.Note(
            model=note_model,
            fields=[kanji['kanji'], ", ".join(kanji['on_readings']), ", ".join(kanji['kun_readings']), ", ".join(kanji['meanings'])],
        ))

genanki.Package(deck).write_to_file('bach_kanji.apkg')

Transforming text into something useful for studying

It’s good to use nltk.text.Text for exploration. It can help identify what can be done/would be beneficial to do with the text. Concordances can help for context/example sentences. Collocations are great for identifying word pairs, which I’ve found to be especially useful for metonyms and compound kanji.

import nltk
import functools

# set up
stopwords = nltk.corpus.stopwords.words(language_in_question)
token_tests = [lambda t: t.isalpha(), lambda t: t not in stopwords]
check_token = lambda t: functools.reduce(lambda accumulator, test_case: accumulator and test_case(t), token_tests, True)
tokens_processor = lambda text: [t for t in nltk.word_tokenize(text.lower()) if check_token(t)]

# processing
tokens = tokens_processor(working_text)
stemmed_tokens = some_initialized_stemmer(tokens)
text = nltk.text.Text(tokens)
stemmed_text = nltk.text.Text(stemmed_tokens)
collocations = text.collocations()
distribution = text.vocab()

Now you just need to decide what words you want. This is where the actual hard part comes in, as you may have noticed getting images or audio to text is very easy these days. Even generating the cards as described below is quite simple, albeit tedious: lots of code and set up including some design work, but not really much thinking. Here, with nltk, is where you decide what you want and how you want to structure it, and finally how to produce it.

I still haven’t figured it out entirely for my own endeavors, but I do hope to keep fleshing this out and offer some suggestions.

Don Quijote de la Mancha

For working with the Project Gutenberg text of the Spanish version of Don Quixote, I split the book into chapters to prepare flashcards for my upcoming readings:

import regex
import spacy
nlp = spacy.load('es_core_news_md')

# split book into chapters, prologue and preamble are in chapters[0]
split_chapters = regex.compile(u'Capítulo \w+\. ')
chapters = split_chapters.split(text_of_don_quijote)

# work with chapter 1 data
chapter1text = nltk.text.Text(nltk.word_tokenizer(chapters[1].lower()))
chapter1tags = nlp(chapters[1])

# get all adjectives and pair them with their concordance lists - a bit lossily
adjectives = [t for t in chapter1tags if t.pos_ == 'ADJ']
adjective_concordances = {t.lemma_:([c.line for c in chapter1text.concordance_list(t.norm_)]) for t in adjectives}
# this data can be looped through like the kanji above to create cards

Still working on an optimal method for automatically generating notes for my efforts, but right now my workflow is to print out lists of certain parts of speech to the terminal and pick out some unfamiliar ones to manually make flash cards for.

Caveats

Japanese is real hard. Turns out tokenization technology is based heavily on whitespace of which there is none in Japanese. You’ll have to investigate a different tokenizer, say McCab perhaps via fugashi, but if you dare try this on Apple Silicon, at least as of the most recent update to this article, you will fail.

Japanese consolation prize

If your minimal aim is to study Kanji important in the target text, one option is to use a regex and just split all the characters, like so:

import regex
a_jp_string = "1685年3月21日，ドイツ中部の町、アイゼナハの聖ゲオルク教会でひとりの男の子が洗礼を受けた。"
filter = regex.compile(u'[^\p{Han}]+')
unique_kanji = set(c for c in filter.sub("", a_jp_string))

# access an API for Kanji information:
import requests
results = [requests.get(f'https://kanjiapi.dev/v1/kanji/{k}') for k in unique_kanji]

Check out the documentation for this API to see how you might work with it. It’s a nice API and you could, for example, use it to filter by grade (so you only study Kanji of a particular level), and research the text for compounds returned by the API to effect a sort of poor- man’s tokenization.

If importing Japanese text via pytesseract, it will be necessary to strip whitespace:

cleaned_text = "".join(imported_text.split())

Some missing parts

In addition to “what words are worth creating cards for,” another question is, unfortunately, how to edit decks rather than continually making new ones. Right now, genanki only supports writing, not reading/modifying, but there may be a fork or impending PR that addresses that.

I haven’t dug into translation APIs thus far either, so having a German word and fetching a translation or dictionary entry for it is still something I have to research, but I don’t think this is a hard problem: the resources are out there I am absolutely confident.

LLM

I’ve been working on getting ChatGPT to generate vocabulary. Presenting a topic, eg, “Give me some good don quijote vocab for reading it in spanish” generally results in a facile set that is predominately character names. Providing it with a small quote results in some better translations than I’ve been able to find elsewhere (better in the sense of more helpful, parsed less greedily because of context, etc), but in general ChatGPT has spat out all the words as vocab in these cases. I’m currently formulating further experiments.