Extracting text from EPUB files in Python
This blog post provides a brief introduction to extracting unformatted text from EPUB files. The occasion for this work was a request by my Digital Humanities colleagues who are involved in the SANE (Secure ANalysis Environment) project. The work on this project includes a use case that will use the SANE environment to analyse text from novels in EPUB format. My colleagues were looking for some advice on how to implement the text extraction component, preferably using a Python-based solution.
So, I started by making a shortlist of potentially suitable tools. For each tool, I wrote a minimal code snippet for processing one file. Based on this I then created some simple demo scripts that show how each tool is used within a processing workflow. Next, I applied these scripts to two data sets, and used the results to obtain a first impression of the performance of each of the tools.
I evaluated the following tools:
Tika-python. This is a Python wrapper for Apache Tika (which itself is a Java application). Apache Tika is a toolkit for text and metadata extraction from a wide range of file formats, including EPUB.
Textract. This offers text extraction functionality that is similar to Tika, but unlike Tika, Textract is natively written in Python.
EbookLib. This is a Python library for reading and writing E-books in various formats, including EPUB (both EPUB 2 en EPUB 3). EbookLib is also the E-book library that is used by Textract.
The following table shows the versions of these tools that I used in my tests:
Test environment and data
For all of my tests I used a simple desktop PC running Linux Mint 20.1 (Ulyssa), MATE edition, with Python 3.8.10.
I used two data sets:
- A selection of 15 files in EPUB 2.0.1 format from the KB’s DBNL (Digital Library for Dutch Literature) collection.
- A selection of 10 files in EPUB 3.2 format from Standard Ebooks.
All files in both data sets are structurally valid EPUB (2.0.1 / 3.2): validation with EPUBCheck 4.2.6 didn’t result in any reported errors or warnings1.
After installing Tika-python, as a first test I tried to write a minimal code snippet that extracts the text from one single EPUB, and then writes the result as UTF-8 encoded text to a file. Following Tika-python’s README (the example under “Parser Interface”), I started out with with this:
#! /usr/bin/env python3 import tika from tika import parser fileIn = "berk011veel01_01.epub" fileOut = "berk011veel01_01.txt" parsed = parser.from_file(fileIn) content = parsed["content"] with open(fileOut, 'w', encoding='utf-8') as fout: fout.write(content)
Metadata strings in text output
Inspection of the resulting output file showed a succession of text strings with the names of embedded fonts towards the end of the file. As an example:
Charis SIL Bold Italic :: :: Charis SIL Small Caps
When I ran Tika (the Java application) directly without using the Tika-python wrapper, results were as expected. A closer inspection of the Tika-python source code showed that Tika-python’s parsing of the Tika output doesn’t quite work the way it should, with the result that extracted metadata is erroneously included in the text output.
Workaround: set service to text
Fortunately there’s a simple workaround for this. In the parser function call, just add the “service” parameter and set its value to “text”, as shown here:
#! /usr/bin/env python3 import tika from tika import parser fileIn = "berk011veel01_01.epub" fileOut = "berk011veel01_01.txt" parsed = parser.from_file(fileIn, service='text') content = parsed["content"] with open(fileOut, 'w', encoding='utf-8') as fout: fout.write(content)
With this change, the font-related text strings were no longer reported.
Image tags and alt-text strings
Unfortunately, setting the “service” parameter in this way has the unexpected side-effect that the text output now includes tags with alt-text descriptions for any images in the file. For example:
[image: cover] Aster Berkhof Veel geluk, professor! [image: DBNL]
Different behaviour between Tika app and TikaServer
I initially thought this was also a bug in Tika-python, but it turns out this isn’t the case. Using the Tika Java application directly:
java -jar ~/tika/tika-app-2.6.0.jar -t berk011veel01_01.epub > berk011veel01_01-app.txt
This resulted in an output file with no alt-text strings. However, Tika-python doesn’t wrap around Tika-app, but instead around TikaServer. After starting TikaServer, I used the command below to processes the same EPUB:
curl -T berk011veel01_01.epub http://localhost:9998/tika --header "Accept: text/plain" > berk011veel01_01-server.txt
The resulting file also included the offending image tags and alt-text strings. So, the Tika application and TikaServer behave differently. After reporting an issue for this, I received a confirmation from Tika’s lead developer:
There’s a subtle difference in the handlers used in tika-app and tika-server. We’re using the “RichTextContentHandler” in server but not in app. I think I’ve known about this for a while, but we’ll be breaking behaviour for whichever one we fix.
I also created a separate issue at Tika-python for the inclusion of metadata in the text output. Unfortunately this issue is closely related to (and partly the result of) the upstream issue in TikaServer. So until that upstream issue is fixed, the current (slightly confusing) situation will most likely persist.
OCR if Tesseract is installed
By default, Tika applies optical character recognition (OCR) to any images in an EPUB if the Tesseract software is installed, and includes the OCR output in the extracted text. In many cases (at least for ours!) this might not be the desired behaviour. I only found out about this weeks after doing the original tests that are described in this post. Re-running some of the tests suddenly resulted in slightly larger output files, with text output that wasn’t originally there. It turns out that the root cause was that I had installed some software that installs Tesseract as a dependency (but I wasn’t aware of this). It’s possible to disable OCR in the Java application and TikaServer using a command-line option that points to a configuration file. I haven’t found a way to do this in Tika-python. The safest option might be to make sure that Teseract is not installed, or to rename Tesseract’s installation folder.
As with Tika-python, as a first test I again created a minimal code snippet for processing one EPUB file:
#! /usr/bin/env python3 import textract fileIn = "berk011veel01_01.epub" fileOut = "berk011veel01_01.txt" content = textract.process(fileIn, encoding='utf-8').decode() with open(fileOut, 'w', encoding='utf-8') as fout: fout.write(content)
For the very first EPUB file (from the DBNL collection) this resulted in an empty output file. Results were similar for most other DBNL EPUBs, and Textract only managed to extract a handful of words at most. Results were considerably better for the “Standard Ebooks” files, with output that was similar to Tika-python in most cases. I reported this issue with the developers.
I mainly included EbookLib, because Textract uses it “under the hood” for EPUB, and I was curious if using it directly would give me similar results as Textract. Based on its documentation I created the following minimal code snippet:
#! /usr/bin/env python3 from html.parser import HTMLParser import ebooklib from ebooklib import epub fileIn = "berk011veel01_01.epub" fileOut = "berk011veel01_01.txt" book = epub.read_epub(fileIn) content = "" for item in book.get_items(): if item.get_type() == ebooklib.ITEM_DOCUMENT: bodyContent = item.get_body_content().decode() f = HTMLFilter() f.feed(bodyContent) content += f.text with open(fileOut, 'w', encoding='utf-8') as fout: fout.write(content)
Compared to Tika-python and Textract, the EbookLib script is a bit more involved, as EbookLib doesn’t provide any high-level text extraction functions. Instead, the user must iterate over all document items, extract the (X)HTML, and then convert that to unformatted text. At first glance, tests with the DBNL and Standard Ebooks EPUBs didn’t result in any issues, and the results were similar to Tika-python.
Based on the above minimal code snippets, I created three simple demonstration scripts for Python-tika, Textract and Ebooklib. Each of these scripts extracts the text of each EPUB file in a user-defined input directory. The extracted text is then written to a user-defined output directory. Each script also writes a file with word counts for the extraction results, which is useful for a rough comparison of the different tools.
I ran each script twice, using the DBNL and Standard Ebooks data sets as input, respectively.
The table below shows the resulting word counts for the books in the DBNL data set:
|File name||Words (Tika)||Words (Textract)||Words (EbookLib)|
Note the extreme (near zero) word counts for Textract. The results for Tika and EbookLib are roughly the same.
Running the scripts on the Standard Ebooks EPUBs gave the following result:
|File name||Words (Tika)||Words (Textract)||Words (EbookLib)|
In this case, all three tools resulted in similar word counts. The exception here is the “King Lear” EPUB, which for Textract gave a word count that was about 10 thousand lower than for the other tools. I haven’t looked in detail where this difference is coming from exactly, but it confirms that in its current state, Textract isn’t a suitable tool for our purposes.
Table of Contents
Depending on the structure of the source EPUB, the extraction result may or may not contain a table of contents. In EPUB 2, the table of contents is implemented as an XML-formatted “Navigation Control File” (NCX). The NCX was replaced by the “Navigation Document” (which is an XHTML file) in EPUB 3. Neither Tika nor EbookLib extract NCX resources, but both do extract Navigation Documents. Consequently, in most cases the extraction result only includes a table of contents for EPUB 3 files. Textract extracts neither the NCX nor the Navigation Document.
Based on these tests, both Tika-python and EbookLib look like potentially suitable Python-based tools for extracting unformatted text from EPUB files. Out of these, Tika-python provides the most straightforward interface. Tika also supports a wide range of other file formats, so any code based on Tika’s text extraction can be easily extended to other formats later.
The inclusion of tags and alt-text descriptions for images in Tika’s output may be a problem though. As an example, imagine a researcher who uses Tika-python to analyse the emergence of certain words or phrases through time using EPUB versions of 19th century books. Any alt-text descriptions in such materials would most likely be contemporary, and as such they would “pollute” the original “signal” (19th century text) with modern language. So, prospective users of Tika-python should carefully review whether this behaviour is acceptable for their use case. The inclusion of optical character recognition output from embedded images in the extraction result can also result in some unexpected surprises, so it’s important that users are aware of Tika’s default behaviour in this regard.
EbookLib doesn’t have these drawbacks, but the absence of a high-level text extraction interface does require some more work on the user’s side. Also, since EbookLib only supports a limited number of Ebook formats, extending any code based on it to other file formats will be less straightforward.
In its current form, Textract is not suitable for our use case.
It’s important to highlight the limitations of this analysis. First, it is based on only two small, homogeneous data sets, both of which only contain structurally valid EPUB files. It’s unclear how well these results translate to more heterogeneous collections (which often contain files that violate the format specifications in various ways). Second, the main objective here was to obtain a broad impression of the behaviour of the tested tools. The scope didn’t include an in-depth analysis of the accuracy and completeness of the extraction results. Finally, I didn’t look into the computational performance of the tested tools. As the SANE use case will only involve processing a limited number of files, performance isn’t important here.
Link to demo scripts
EPUB text extraction demo:
For convenience I actually used the EPUBCheck Python wrapper: https://github.com/titusz/epubcheck/. ↩
- Extracting text from EPUB files in Python
- ISO/IEC TS 22424 standard on EPUB3 preservation
- Valid, but not accessible: crazy fixed EPUB layouts
- The future of EPUB? A first look at the EPUB 3.1 Editor’s draft
- Policy-based assessment of EPUB with Epubcheck
- EPUB for archival preservation: an update
- EPUB for archival preservation