In this article I will go over the details of a recent project involving extracting data from a PDF Swahili-English dictionary.
Before we get started, it is important to understand that the PDF format
differs significantly from ePUB and HTML. The PDF format is designed to
enable the exact reproduction of the source material's content and layout
regardless of the final output medium (screen or paper). Unlike HTML files
where layout is left largely to whichever program is displaying the file, PDF
files describe exactly where on the screen (or page) all text and images
should be rendered.
There are of course good reasons for this, but it can make extracting data
harder depending on choices made by the PDF's creator. We can't conveniently
look for some tag as in HTML. It's not a hopeless situation though as there
are some techniques we can make use of. Primarily we'll be relying on
identifying content and layout patterns shared by the targets of our
extraction.
The first step unsurprisingly, is having a look at the PDF file. You can open it in you viewer of choice. I favour Zathura but have recently been trying out Sioyek.
Two useful tools for gaining an insight into the underlying structure of a PDF
file are the pdfquery
Python library and the Java GUI application
pdfbox
. pdfquery
allows you to convert the contents of a PDF file
into XML format making human inspection easier. Additionally, the library
allows you to extract specific parts of the PDF file using a JQuery-like
interface.
pdfbox
on the other hand is a general tool for working with PDF files. Some
of its functionality includes:
One other useful utility is mutool
which is part of mupdf.
It shares a lot of overlap with pdfbox
but is CLI only. An example
is using it to extract a single page from a PDF:
mutool convert -o test_p100.pdf swahili-english.pdf 100
We can open this file with pdfbox debug
.
pdfbox debug test_p100.pdf
The pdfbox
feature I've found most useful is being able to find the approximate (x, y)
coordinates of various elements. The bottom left of the GUI window shows the
position of the mouse cursor using the PDF file's dimensions as a frame of reference.
On the other hand with pdfquery
we can output the XML representation of our
test page.
import pdfquery
pdf = pdfquery.PDFQuery('test_p100.pdf')
pdf.load()
pdf.tree.write('test_p100.xml')
We can view the resulting XML as-is but I use tidy
to improve the
formatting of the resulting XML before piping the results to bat
for syntax
highlighting.
tidy -xml -indent test_p100.xml | bat -l xml
At this point - if you're lucky - you should be able to identify some attribute of a PDF object based on which to extract the relevant parts from the PDF. An example of this can be found in this article.
Our case requires more effort. We use the another library pypdf
which allows
us to extract text using a custom function to control which parts of a page we
want to process.
Our function simply checks if the text in an element is not empty and builds a dictionary with the text as well as some other useful attributes adding it to a list we use to keep track of all the text we've found in the PDF file.
from pypdf import PdfReader
def visitor_body(
text: str,
cm: [float],
tm: [float],
font_dict: dict[str, Any],
font_size: float,
page_no: int,
parts: [Any],
):
if cleaned_text := str.strip(text):
parts.append(
{
"text": cleaned_text,
"font": font_dict["/BaseFont"],
"dimensions": tm,
"page": page_no,
}
)
return
The first 5 arguments are all that are required by the visitor supplied to
extract_text
but we use 2 additional arguments page_no
and parts
.
The former is for keeping track of which page each piece of text is found on
and the latter is the list used to collect all the text pieces. We dump
parts
into a JSON file to facilitate the later stages of extraction.
import json
from functools import partial
def process_pdf():
reader = PdfReader("swahili-english.pdf")
parts = []
for i, page in enumerate(reader.pages):
page.extract_text(
visitor_text=partial(visitor_body, page_no=i, parts=parts)
)
with open("data/pdf_parts.jsonl", "w") as fp:
for part in parts:
json.dump(part, fp)
fp.write("\n")
return parts
Here's an example of how each part appears:
shuf data/pdf_parts.jsonl | head -1 | jq .
{
"text": "bombo",
"font": "/Times New Roman,Bold",
"dimensions": [1.0, 0.0, 0.0, 1.0, 56.64, 595.78],
"page": 27
}
With the text parts extracted from the PDF file, we can re-open the file in a viewer to take a closer look.
From the above screenshot we can see that some of the data we be interested in extracting include:
There is more information we'll be trying to extract but the above are the most important.
Based on the formatting of the entries in the file, we can only rely on a few attributes to differentiate between them:
We use the first 2 of the above attributes to group the text parts into
potential entries. With the pdf_parts.jsonl
we saved earlier, we try and
find some commonality that can be used to identify the start of each entry.
Using pdfbox
we find out that each entry begins with an x coordinate roughly
between 55 and 60. Loading the parts into DuckDB we can confirm this with
a simple analysis.
create table pdf_parts as
select *
from read_json_auto('data/pdf_parts.jsonl', format = 'newline_delimited');
select dimensions[5] as x_coord, count(*)
from pdf_parts
group by x_coord
order by 2 desc
limit 3;
x_coord | count_star() |
---|---|
0.0 | 22784 |
56.64 | 19033 |
60.6 | 1828 |
It's safe to assume that parts with dimensions[5] = 56.64
correspond to individual
entries. We use another query to number each part in order to produce indices.
We write the indices belonging to the potential entries to a file for use
later.
select idx as word_index
from (
select
dimensions, font, row_number() over () as idx
from pdf_parts
)
where dimensions[5] = 56.64 and font like '%Bold';
word_index |
---|
1 |
6 |
14 |
20 |
33 |
37 |
41 |
48 |
54 |
... |
Saving the indices can be done easily by changing the mode
to csv
in
DuckDB and setting the output to a file.
COPY (/* query above */) TO 'data/word_indices.txt' (DELIMITER ',', HEADER false);
These indices are used to group the text parts into entries by assuming that all the parts between the indices belong together. So for example the text parts from 1 up to but not including 6 belong to a single entry.
from operator import sub
from functools import cache
from toolz import curry, flip, keyfilter, pipe
@cache
def read_parts_as_entries():
with open("data/word_indexes.txt", "r") as fp:
indices = fp.readlines()
indices = pipe(
indices,
curry(map, str.strip),
curry(map, int),
curry(map, flip(sub, 1)),
list,
)
with open("data/pdf_parts.jsonl", "r") as fp:
lines = [json.loads(line) for line in fp.readlines()]
entries = []
for i, k in enumerate(indices):
if i == len(indices) - 1:
entry = lines[k:]
else:
entry = lines[k : indices[i + 1]]
entries.append(
pipe(
entry,
curry(map, curry(keyfilter, lambda x: x in ["text", "font"])),
list,
)
)
return words
Other than reading the indices and pdf parts, we do some processing on the
contents of the two files. For the indices we convert them to int
and
subtract 1 so they correspond to 0-based list
indices. For the parts we
select only the text
and font
fields. functools.cache
is used to prevent
the files from being opened and read multiple times while working in the REPL.
To parse the different parts of each entry we'll be relying mostly on the relative order of the elements. My initial approach was to write a bunch of functions for processing each section, calling each as necessary which turned out being close to a recursive descent parser. Since each entry has a regular structure, our task can indeed be thought of as writing a parser.
Although this initial approach worked, it quickly became messy due to needing to handle malformed dictionary entries as well as entries with formats that differ from what we've seen above. The main culprit was the need to track state outside each individual function.
Before we move on to the fix, I have to mention that my tendency to favour the functional style of programming ended up being detrimental in this particular case. The requirement of writing pure functions meant that each function had to have additional arguments that were unnecessary for some but still required in order to be able to pass them to sub-calls to between functions. I'm sure there is a clean way to write the code in a functional style but it was not immediately apparent to me.
The silver bullet ended up being using a state machine to control the transitions between the parsing functions. State machines allows for the conditions governing the parsing functions to be separated from the actual parsing. Additionally it becomes simpler to parse a variety of entry formats malformed or otherwise.
The methodology I used to design the state machine was to start with a simple entry to determine the relationships between the parsing functions. Afterwards I hooked the parsing functions to the states representing each section and attempted to parse randomly sampled entries progressively refining the state machine and relevant parsing functions to their final forms.
The resultant state machine is as follows:
For implementing the state machine I used the python-statemachine
library which is pretty comprehensive. In particular it lets us
separate the state machine description from other associated logic.
We define 3 classes:
DictEntry
which is a dataclass
that holds the results after parsing.DictEntryControl
which is the actual state machine.DictEntryModel
which contains all the parsing logic i.e the code that
extracts the data we eventually put in DictEntry
The DictEntry
class is simple:
from typing import Optional
from dataclasses import asdict, dataclass, field
@dataclass
class DictEntry:
swahili: str = "" # swahili word
swahili_alt: Optional[str] = None # alternate form of the swahili word with same meaning
part_of_speech: str = "" # part of speech tag
plural: Optional[str] = None # plural form of swahili noun word
ngeli: Optional[str] = None # ngeli of noun word if one exists
english: list[str] = field(default_factory=list) # english meaning(s) of swahili word
examples: Optional[list[tuple[str, str]]] = None # example usages in form (swahili text, english text)
alternates: Optional[list[tuple[str, str]]] = None # alternate forms of swahili verb word
The state machine implementation encodes the design we formulated previously
with the addition that all - except start
and stop
- states have a function
that is executed when the state is entered (the entry
argument).
These linked functions are implemented in DictEntryModel
and parse a single
part of the entry. Transitions between states sometimes have boolean
functions (the cond
argument) that serve as guards. The transitions from a
single state are tried in order until one with all positive conditions or
without any guards is reached.
from statemachine import State, StateMachine
class DictEntryMachine(StateMachine):
start = State(initial=True)
swahili = State(enter="extract_swahili")
extra = State(enter="extract_extra")
pos = State(enter="extract_pos")
ngeli = State(enter="extract_ngeli")
plural = State(enter="extract_plural")
english = State(enter="extract_english")
examples = State(enter="extract_example")
alternates = State(enter="extract_alternates")
stop = State(final=True)
parse = (
start.to(swahili, cond="head_bold")
| swahili.to(extra, cond=["has_extra"])
| swahili.to(pos)
| extra.to(pos)
| pos.to(plural, cond=["is_noun", "has_plural"])
| pos.to(ngeli, cond="is_noun")
| plural.to(ngeli)
| pos.to(english)
| ngeli.to(english)
| english.to(examples, cond="has_example")
| english.to(alternates, cond="has_alternates")
| english.to(english, cond="has_multiple_defs")
| english.to(stop)
| examples.to(english, cond="has_multiple_defs")
| examples.to(alternates, cond="has_alternates")
| examples.to(stop)
| alternates.to(stop)
)
Finally the DictEntryModel
class consists primarily of two types of methods:
extract_<state>
) that are called when the
state machine reaches <state>
. These functions extract the relevant section of
the entry and set the fields of a DictEntry
instance.has_<.+>
) that serve as guards for
transitions between states.Instead of going over all the methods, I'll highlight an example of each type.
There's an important abstraction/model I chose to use while implementing the parsing functions. You may have already guessed it when we grouped text parts into potential entries. Each entry is treated as a stream of text pieces. This means that traversal is almost always done linearly and random access is seldom used outside a linear context e.g to access the next element based on the index of the current element in a loop.
The reasons behind this are generality and simplicity. Given that the parsing functions are dependent on each other, it is easier to implement each when they all have a unified way of interacting with the common entry. The stream interface is also easier to debug since we know that each function operates starting from the head of the stream. This was essential with the approach I chose of evolving both the parsing functions and state machine based on random samples of the input. The abstraction itself consists of some types and sticking to convention1.
The following are those types as well as some useful utility functions:
from typing import TypeAlias, TypedDict
class TextPiece(TypedDict):
font: str
text: str
Stream: TypeAlias = list[TextPiece]
def is_italic(piece: TextPiece) -> bool:
return piece["font"].endswith("Italic")
def is_bold(piece: TextPiece) -> bool:
return piece["font"].endswith("Bold")
def is_normal(piece: TextPiece) -> bool:
return not (is_italic(piece) or is_bold(piece))
def texts(stream: Stream) -> list[str]:
return [x["text"] for x in stream]
def as_text(stream: Stream) -> str:
return " ".join(texts(stream))
The parsing functions follow the template below:
DictEntry
object.I'll be using extract_ngeli
to illustrate the specific way each of the above
parts are implemented. The numbered comment correspond to the template above.
import re
from copy import deepcopy
from itertools import dropwhile, takewhile
class DictEntryModel:
def __init__(self, stream: Stream):
self.stream = stream
self.text = as_text(stream)
self.entry: DictEntry = DictEntry()
self.parse_ok = True
self.state = None
...
def extract_ngeli(self):
# 1.
stream = deepcopy(self.stream)
# 2.
stream = list(
dropwhile(lambda x: re.match(r"\s*-\*s", x["text"]), stream)
)
w1 = stream[0]
# 3.
if "[" in w1["text"]:
# 4.
ngeli_toks = list(takewhile(lambda x: "]" not in x, texts(stream)))
elif is_italic(w1):
ngeli_toks = texts(list(takewhile(lambda x: is_italic(x), stream)))
else:
self.parse_ok = False
return
# 5.
ngeli = pipe(
"".join(ngeli_toks),
curry(re.sub, r"\s+", ""),
curry(re.sub, r"^[)\]]", ""),
)
patt = re.compile(r"\[?[a-z]{1,2}-(?:/[a-z]{2}-)?")
if m := re.match(patt, ngeli):
# 5.
self.entry.ngeli = re.sub(r"\[|/", "", m.string)
# 6.
stream = stream[len(ngeli_toks) :]
self.stream = stream
...
The specific parts of each function depend entirely on the structure of the text so the details are not important.
The boolean functions serve to enforce assumptions we make use of in the parsing functions. If said assumptions do not hold then running a parsing function is likely to either fail outright or at a later point in execution.
For example the swahili
word section of an entry is always expected to use
bold font and be at the start of the stream so if that's not the case there's
no need to try the associated parsing function.
class DictEntryModel:
...
def head_bold(self):
return is_bold(self.stream[0])
...
The function to parse a single entry involves instantiating the right classes
and running the state machine until it stops or an error occurs (indicated by
parsing functions through setting self.parse_ok = False
in DictEntryModel
).
def process(entry):
model = DictEntryModel(entry)
machine = DictEntryMachine(model)
while (
machine.current_state not in machine.final_states
and machine.model.parse_ok
):
machine.parse()
return model
As mentioned before I evolved the state machine and parsing function using
sample entries. I used a custom REPL to try parse each entry in the sample and
inspect the results. A custom listener was attached to the state machine to
indicate when transitions are occurred The latter is a trick shown on the
python-statemachine
docs. The REPL requires a slightly modified
process
function.
import code
import random
from textwrap import dedent
class LogListener(object):
def __init__(self, name):
self.name = name
def after_transition(self, event, source, target):
print(f"{self.name} after: {source.id}--({event})-->{target.id}")
def on_enter_state(self, target, event):
print(f"{self.name} enter: {target.id} from {event}")
def process_debug(entry):
model = DictEntryModel(entry)
machine = DictEntryMachine(model, listeners=[LogListener("debug:")])
while (
machine.current_state not in machine.final_states
and machine.model.parse_ok
):
machine.parse()
return model, machine
def run(sample_size=10):
print(
dedent("""
r - regenerate sample
n - process next entry
i - inspect current entry
w - write current entry to data/test_cases.jsonl
""")
)
entries = read_parts_as_entries()
sample = random.sample(entries, sample_size)
next_ = None
model = None
sm = None
while command := input("> "):
print(f"{len(sample)} of {sample_size} left\n")
match command[0].lower():
case "r":
sample = random.sample(entries, sample_size)
continue
case "n":
if not sample:
print("sample done. Use [r] to generate new sample")
continue
next_ = sample.pop()
print(f"processing:\n\t{as_text(next_)}")
model, sm = process_debug(next_)
case "i":
code.interact(
banner="process inspect REPL.\n try:\n\twat / model\n\twat / sm",
local=locals() | globals(),
)
case "w":
print("writing example to test_cases\n")
with open("data/test_cases.jsonl", "a") as fp:
s = json.dumps(next_) + "\n"
fp.write(s)
return
The actual 'main' function runs process
on each entry, collects some
statistics and saves the parsed data all with a nice progress bar:
from tqdm import tqdm
from rich.pretty import pprint
def main():
entries = read_parts_as_entries()
stats = {"succeeded": 0, "failed": 0}
fp = open("data/swahili-english-dict.jsonl", "w")
fp2 = open("data/failed_stream.jsonl", "w")
id_ = 1
for entry in tqdm(entries):
try:
de = process(entry)
if de.parse_ok and len(de.stream) < 3:
d = asdict(de.entry)
d["id"] = id_
s = json.dumps(d) + "\n"
fp.write(s)
stats["succeeded"] += 1
id_ += 1
else:
s = json.dumps(entry) + "\n"
fp2.write(s)
stats["failed"] += 1
except Exception as e:
s = json.dumps(entry) + "\n"
fp2.write(s)
stats["failed"] += 1
fp.close()
fp2.close()
pprint(stats)
We can see a sample result with:
shuf data/swahili-english-dict.jsonl | head -1 | jq
{
"swahili": "warsha",
"swahili_alt": null,
"part_of_speech": "nomino (noun)",
"plural": null,
"ngeli": "i-zi-",
"english": ["workshop."],
"examples": [["warsha ya uandishi", "writers’ workshop"]],
"alternates": null,
"id": 13023
}
An additional transformation is carried out on the results to clean up the
entries for presentation. I'll omit that code here since it's some string
munging. It can be seen in the kamusi.py
file in this project's repo.
With the entries finally extracted from the PDF file, we can use it
through a more convenient interface. One way to do this is to make use of DuckDB's
Full-Text Search functionality. We load the dataset then build an
index on the swahili
and english
fields by running the following SQL statements.
create or replace table kamusi as
select *
from
read_json_auto('data/swahili-english-dict.jsonl', format = 'newline_delimited');
pragma create_fts_index('kamusi', 'id', 'swahili', 'english');
We can search the index as follows:
SELECT
sq.id,
sq.swahili,
ka.swahili_alt,
ka.plural,
ka.part_of_speech,
ka.ngeli,
sq.english as english,
ka.examples as examples,
ka.alternates as alternates,
score
FROM (
SELECT *, fts_main_kamusi.match_bm25(
id,
'workshop',
fields := 'english'
) AS score
FROM kamusi
) sq
INNER JOIN kamusi ka ON ka.id = sq.id
WHERE score IS NOT NULL
ORDER BY score DESC;
The results with .mode = line
set in DuckDB are:
id = 3991
swahili = karakana
swahili_alt =
plural =
part_of_speech = nomino (noun)
ngeli = i-zi-
english = [workshop.]
examples =
alternates =
score = 4.412508423086409
id = 13023
swahili = warsha
swahili_alt =
plural =
part_of_speech = nomino (noun)
ngeli = i-zi-
english = [workshop.]
examples = [[warsha ya uandishi, writers’ workshop]]
alternates =
score = 4.412508423086409
id = 5293
swahili = kiwanda
swahili_alt =
plural = vi
part_of_speech = nomino (noun)
ngeli = ki-vi-
english = [factory, industry, workshop.]
examples =
alternates =
score = 3.4721397900793924
We re-use this query alongside rich
to make a basic terminal UI. I use
subprocess to execute the query through the DuckDB CLI rather than the
Python library mainly due to laziness.
Different DuckDB versions require you to have the extensions installed separately so rather than having to re-install extensions each time the DuckDB library is updated within an environment, I call the DuckDB CLI which is guaranteed to have everything setup correctly even after updates auto-magically.2
I'll skip going over the code for the CLI since it's still subject to change. Here are the results instead:
I use types in Python as additional documentation and seldom do type checking beyond that of the language server attached to the editor.
↩The magic is a pacman hook that runs a script that installs the extensions when DuckDB is updated.
↩