Extracting text from EPUB files in Python

09 March 2023

Street scene showing crowd gathered around an open carriage in which a dentist performs a tooth extraction on a patient. Next to the patient a man is banging on a large drum. — Clockwork picture of an itinerant dentist performing an extraction in French rural scene, wood frame, metal workings, first half 19th century. Science Museum, London. Attribution 4.0 International (CC BY 4.0) (cropped from original).

This blog post provides a brief introduction to extracting unformatted text from EPUB files. The occasion for this work was a request by my Digital Humanities colleagues who are involved in the SANE (Secure ANalysis Environment) project. The work on this project includes a use case that will use the SANE environment to analyse text from novels in EPUB format. My colleagues were looking for some advice on how to implement the text extraction component, preferably using a Python-based solution.

So, I started by making a shortlist of potentially suitable tools. For each tool, I wrote a minimal code snippet for processing one file. Based on this I then created some simple demo scripts that show how each tool is used within a processing workflow. Next, I applied these scripts to two data sets, and used the results to obtain a first impression of the performance of each of the tools.

Evaluated tools

I evaluated the following tools:

Tika-python. This is a Python wrapper for Apache Tika (which itself is a Java application). Apache Tika is a toolkit for text and metadata extraction from a wide range of file formats, including EPUB.
Textract. This offers text extraction functionality that is similar to Tika, but unlike Tika, Textract is natively written in Python.
EbookLib. This is a Python library for reading and writing E-books in various formats, including EPUB (both EPUB 2 en EPUB 3). EbookLib is also the E-book library that is used by Textract.
PyMuPDF. This is a Python binding for MuPDF. MuPDF is primarily a PDF library, but it also supports EPUB.

The following table shows the versions of these tools that I used in my tests:

Software	Version
Tika-python	2.6.0
Textract	1.6.5
EbookLib	0.18
PyMuPDF	1.24.11

Test environment and data

For all of my tests I used a simple desktop PC running Linux Mint 20.1 (Ulyssa), MATE edition, with Python 3.8.10.

I used two data sets:

A selection of 15 files in EPUB 2.0.1 format from the KB’s DBNL (Digital Library for Dutch Literature) collection.
A selection of 10 files in EPUB 3.2 format from Standard Ebooks.

All files in both data sets are structurally valid EPUB (2.0.1 / 3.2): validation with EPUBCheck 4.2.6 didn’t result in any reported errors or warnings¹.

Tika-python

After installing Tika-python, as a first test I tried to write a minimal code snippet that extracts the text from one single EPUB, and then writes the result as UTF-8 encoded text to a file. Following Tika-python’s README (the example under “Parser Interface”), I started out with with this:

#! /usr/bin/env python3

import tika
from tika import parser

fileIn = "berk011veel01_01.epub"
fileOut = "berk011veel01_01.txt"

parsed = parser.from_file(fileIn)
content = parsed["content"]

with open(fileOut, 'w', encoding='utf-8') as fout:
    fout.write(content)

Metadata strings in text output

Inspection of the resulting output file showed a succession of text strings with the names of embedded fonts towards the end of the file. As an example:

Charis SIL Bold Italic

::
::

Charis SIL Small Caps

When I ran Tika (the Java application) directly without using the Tika-python wrapper, results were as expected. A closer inspection of the Tika-python source code showed that Tika-python’s parsing of the Tika output doesn’t quite work the way it should, with the result that extracted metadata is erroneously included in the text output.

Workaround: set service to text

Fortunately there’s a simple workaround for this. In the parser function call, just add the “service” parameter and set its value to “text”, as shown here:

#! /usr/bin/env python3

import tika
from tika import parser

fileIn = "berk011veel01_01.epub"
fileOut = "berk011veel01_01.txt"

parsed = parser.from_file(fileIn, service='text')
content = parsed["content"]

with open(fileOut, 'w', encoding='utf-8') as fout:
    fout.write(content)

With this change, the font-related text strings were no longer reported.

Image tags and alt-text strings

Unfortunately, setting the “service” parameter in this way has the unexpected side-effect that the text output now includes tags with alt-text descriptions for any images in the file. For example:

[image: cover]

Aster Berkhof

Veel geluk, professor!

[image: DBNL]

Different behaviour between Tika app and TikaServer

I initially thought this was also a bug in Tika-python, but it turns out this isn’t the case. Using the Tika Java application directly:

java -jar ~/tika/tika-app-2.6.0.jar -t berk011veel01_01.epub  > berk011veel01_01-app.txt

This resulted in an output file with no alt-text strings. However, Tika-python doesn’t wrap around Tika-app, but instead around TikaServer. After starting TikaServer, I used the command below to processes the same EPUB:

curl -T berk011veel01_01.epub  http://localhost:9998/tika  --header "Accept: text/plain" > berk011veel01_01-server.txt

The resulting file also included the offending image tags and alt-text strings. So, the Tika application and TikaServer behave differently. After reporting an issue for this, I received a confirmation from Tika’s lead developer:

There’s a subtle difference in the handlers used in tika-app and tika-server. We’re using the “RichTextContentHandler” in server but not in app. I think I’ve known about this for a while, but we’ll be breaking behaviour for whichever one we fix.

I also created a separate issue at Tika-python for the inclusion of metadata in the text output. Unfortunately this issue is closely related to (and partly the result of) the upstream issue in TikaServer. So until that upstream issue is fixed, the current (slightly confusing) situation will most likely persist.

OCR if Tesseract is installed

By default, Tika applies optical character recognition (OCR) to any images in an EPUB if the Tesseract software is installed, and includes the OCR output in the extracted text. In many cases (at least for ours!) this might not be the desired behaviour. I only found out about this weeks after doing the original tests that are described in this post. Re-running some of the tests suddenly resulted in slightly larger output files, with text output that wasn’t originally there. It turns out that the root cause was that I had installed some software that installs Tesseract as a dependency (but I wasn’t aware of this). It’s possible to disable OCR in the Java application and TikaServer using a command-line option that points to a configuration file. I haven’t found a way to do this in Tika-python. The safest option might be to make sure that Teseract is not installed, or to rename Tesseract’s installation folder.

Textract

As with Tika-python, as a first test I again created a minimal code snippet for processing one EPUB file:

#! /usr/bin/env python3

import textract

fileIn = "berk011veel01_01.epub"
fileOut = "berk011veel01_01.txt"

content = textract.process(fileIn, encoding='utf-8').decode()

with open(fileOut, 'w', encoding='utf-8') as fout:
    fout.write(content)

For the very first EPUB file (from the DBNL collection) this resulted in an empty output file. Results were similar for most other DBNL EPUBs, and Textract only managed to extract a handful of words at most. Results were considerably better for the “Standard Ebooks” files, with output that was similar to Tika-python in most cases. I reported this issue with the developers.

EbookLib

I mainly included EbookLib, because Textract uses it “under the hood” for EPUB, and I was curious if using it directly would give me similar results as Textract. Based on its documentation I created the following minimal code snippet:

#! /usr/bin/env python3

from html.parser import HTMLParser
import ebooklib
from ebooklib import epub

fileIn = "berk011veel01_01.epub"
fileOut = "berk011veel01_01.txt"

book = epub.read_epub(fileIn)
content = ""

for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        bodyContent = item.get_body_content().decode()
        f = HTMLFilter()
        f.feed(bodyContent)
        content += f.text

with open(fileOut, 'w', encoding='utf-8') as fout:
    fout.write(content)

Compared to Tika-python and Textract, the EbookLib script is a bit more involved, as EbookLib doesn’t provide any high-level text extraction functions. Instead, the user must iterate over all document items, extract the (X)HTML, and then convert that to unformatted text. At first glance, tests with the DBNL and Standard Ebooks EPUBs didn’t result in any issues, and the results were similar to Tika-python.

PyMuPDF

For PyMuPDF, I created the following minimal code snippet:

#! /usr/bin/env python3
import pymupdf

fileIn = "berk011veel01_01.epub"
fileOut = "berk011veel01_01.txt"

with pymupdf.open(fileIn) as doc:
    content = ""
    noChapters = doc.chapter_count
    # Iterate over chapters
    for i in range(noChapters):
        chapter_page_count = doc.chapter_page_count(i)
        chapter_text = ""
        # Iterate over pages in chapter
        for j in range(chapter_page_count):
            page = doc[(i, j)]
            chapter_text += page.get_text()
        content += chapter_text
        # Add linebreak to mark end of chapter
        content += "\n"

with open(fileOut, 'w', encoding='utf-8') as fout:
    fout.write(content)

There are a couple of things to note here.

Chapters and pages

First, like EbookLib, we need to explicitly iterate over all chapters in the EPUB. Second, PyMuPDF’s document model is built on pages, which probably reflects its origins as a PDF library. However, the EPUB format doesn’t really have any notion of “pages” at all. Nevertheless, since PyMuPDF’s text extraction function only works at the page level, we still need to iterate over the “pages” of each chapter, even though it’s not entirely clear to me how PyMuPDF defines them. Since by default the chapter texts are simply concatenated, I explicitly added a linebreak to more clearly delineate the end of each chapter.

Wrapping and linebreaks

Unlike the other tools tested here, PyMuPDF wraps the extracted text to a fixed page width, and inserts linebreaks at the wrapping boundaries. As an example, look at the following sentence in the source XHTML, which is encoded as one single line:

<div class="plat">Het was wonderbaar. Pierre ademde diep, en de lucht was zo dun en zo ijl, dat zijn hoofd er duizelig van werd.</div>

In the corresponding PyMuPDF output, the text is split across two separate lines:

Het was wonderbaar. Pierre ademde diep, en de lucht was zo dun
en zo ijl, dat zijn hoofd er duizelig van werd.

Whether this is actually a problem will probably depend on the use case, but it’s good to be aware that this happens.

Finally, for the EPUBs in the KB’s DBNL dataset, PyMuPDF reported multiple instances of the following error:

MuPDF error: syntax error: css syntax error: unexpected token (OEBPS/template.css:1) (   >@<font-face {font-family: "sc...)

The error is related to a style sheet, and doesn’t appear to affect text extraction.

Demonstration scripts

Based on the above minimal code snippets, I created four simple demonstration scripts for Python-tika, Textract, Ebooklib and PyMuPDF. Each of these scripts extracts the text of each EPUB file in a user-defined input directory. The extracted text is then written to a user-defined output directory. Each script also writes a file with word counts for the extraction results, which is useful for a rough comparison of the different tools.

I ran each script twice, using the DBNL and Standard Ebooks data sets as input, respectively.

Word counts

The table below shows the resulting word counts for the books in the DBNL data set:

File name	Words (Tika)	Words (Textract)	Words (EbookLib)	Words (PyMuPDF)
eern001lief01_01.epub	25450	1	25446	25451
spro002mure01_01.epub	50553	0	50549	50554
berk011veel01_01.epub	67978	0	67974	67978
sche034drie01_01.epub	203853	3	203352	203864
jous010supe01_01.epub	202495	0	202491	202494
dele035wegv01_01.epub	76536	0	76530	76530
verv017eerl01_01.epub	33844	0	33840	33855
dhae007euro01_01.epub	394455	2	394400	394879
gomm002uurw01_01.epub	43754	0	43731	43748
gang009lalb01_01.epub	28453	4	28381	28390
geel005bloe01_01.epub	76316	0	76312	76313
hart008droo02_01.epub	77283	0	77279	77282
eede003vand04_01.epub	120481	6	120310	120553
meij031tuss02_01.epub	145678	4	145665	145692
maas013blau01_01.epub	55099	0	55093	55108

Note the extreme (near zero) word counts for Textract. The results for Tika, EbookLib and PyMuPDF are roughly the same.

Running the scripts on the Standard Ebooks EPUBs gave the following result:

File name	Words (Tika)	Words (Textract)	Words (EbookLib)	Words (PyMuPDF)
william-shakespeare_king-lear.epub	28442	18621	28430	28357
david-garnett_lady-into-fox.epub	25240	25223	25228	25208
joseph-conrad_heart-of-darkness.epub	38717	38698	38705	38735
anthony-trollope_the-dukes-children.epub	223014	222995	223002	222712
agatha-christie_the-mysterious-affair-at-styles.epub	57401	57229	57271	57227
edgar-allan-poe_the-narrative-of-arthur-gordon-pym-of-nantucket.epub	71931	71837	71863	71844
p-g-wodehouse_short-fiction.epub	212224	212182	212212	212186
robert-louis-stevenson_the-strange-case-of-dr-jekyll-and-mr-hyde.epub	26370	26345	26358	26291
h-g-wells_the-time-machine.epub	33044	33024	33032	33018
thorstein-veblen_the-theory-of-the-leisure-class.epub	106537	106515	106525	106512

In this case, all four tools resulted in similar word counts. The exception here is the “King Lear” EPUB, which for Textract gave a word count that was about 10 thousand lower than for the other tools. I haven’t looked in detail where this difference is coming from exactly, but it confirms that in its current state, Textract isn’t a suitable tool for our purposes.

Depending on the structure of the source EPUB, the extraction result may or may not contain a table of contents. In EPUB 2, the table of contents is implemented as an XML-formatted “Navigation Control File” (NCX). The NCX was replaced by the “Navigation Document” (which is an XHTML file) in EPUB 3. Neither Tika nor EbookLib extract NCX resources, but both do extract Navigation Documents. Consequently, in most cases the extraction result only includes a table of contents for EPUB 3 files. Both Textract and PyMuPDF extract neither the NCX nor the Navigation Document.

Conclusions

Based on these tests, Tika-python, EbookLib and PyMuPDF all look like potentially suitable Python-based tools for extracting unformatted text from EPUB files. Out of these, Tika-python provides the most straightforward interface. Tika also supports a wide range of other file formats, so any code based on Tika’s text extraction can be easily extended to other formats later.

The inclusion of tags and alt-text descriptions for images in Tika’s output may be a problem though. As an example, imagine a researcher who uses Tika-python to analyse the emergence of certain words or phrases through time using EPUB versions of 19th century books. Any alt-text descriptions in such materials would most likely be contemporary, and as such they would “pollute” the original “signal” (19th century text) with modern language. So, prospective users of Tika-python should carefully review whether this behaviour is acceptable for their use case. The inclusion of optical character recognition output from embedded images in the extraction result can also result in some unexpected surprises, so it’s important that users are aware of Tika’s default behaviour in this regard.

EbookLib doesn’t have these drawbacks, but the absence of a high-level text extraction interface does require some more work on the user’s side. Also, since EbookLib only supports a limited number of Ebook formats, extending any code based on it to other file formats will be less straightforward.

Although PyMuPDF generally looks useful for EPUB text extraction, its built-in text wrapping with the addition of linebreaks might be unwanted for some use cases. PyMuPDF’s page-based document model also makes this library somewhat more involved to use, compared against the other tested tools.

In its current form, Textract is not suitable for our use case.

Limitations

It’s important to highlight the limitations of this analysis. First, it is based on only two small, homogeneous data sets, both of which only contain structurally valid EPUB files. It’s unclear how well these results translate to more heterogeneous collections (which often contain files that violate the format specifications in various ways). Second, the main objective here was to obtain a broad impression of the behaviour of the tested tools. The scope didn’t include an in-depth analysis of the accuracy and completeness of the extraction results. Finally, I didn’t look into the computational performance of the tested tools. As the SANE use case will only involve processing a limited number of files, performance isn’t important here.

Link to demo scripts

EPUB text extraction demo:

https://github.com/KBNLresearch/textExtractDemo

Revision history

21 January 2025: added PyMuPDF analysis.

For convenience I actually used the EPUBCheck Python wrapper: https://github.com/titusz/epubcheck/. ↩

Comments

Post a comment by replying to this post using your ActivityPub (e.g. Mastodon) account.

Extracting text from EPUB files in Python

Evaluated tools

Test environment and data

Tika-python

Metadata strings in text output

Workaround: set service to text

Image tags and alt-text strings

Different behaviour between Tika app and TikaServer

OCR if Tesseract is installed

Textract

EbookLib

PyMuPDF

Chapters and pages

Wrapping and linebreaks

Demonstration scripts

Word counts

Table of Contents

Conclusions

Limitations

Link to demo scripts

Revision history

Comments

About

Search

Tags

Archive

June

May

April

December

November

October

March

June

May

March

February

January

November

June

April

March

September

February

September

June

April

March

February

September

April

March

January

July

April

July

June

April

January

December

April

March

December

November

October

July

April

March

January

December

November

October

September

August

January

October

September

August

July

May