Last year (2012) the KB released a report on the suitability of the EPUB format for archival preservation. A substantial number of EPUB-related developments have happened since then, and as a result some of the report’s findings and conclusions have become outdated. This applies in particular to the observations on EPUB 3, and the support of EPUB by characterisation tools. This blog post provides an update to those findings. It addresses the following topics in particular:
- Use of EPUB in scholarly publishing
- Adoption and use of EPUB 3
- EPUB 3 reader support
- Support of EPUB by characterisation tools
In the following sections I will briefly summarise the main developments in each of these areas, after which I will wrap up things in a concluding section.
About a year ago, work started on packaging SCAPE tools. Jpylyzer was the first SCAPE tool that was turned into a Debian package. Some time later, the OPF set up a couple of machine images at Amazon Web Services, which can be used to create packages repeatedly using a virtual machine. Even though I’ve used the Amazon service a couple of times myself, I really know next to nothing about Debian packages, and it’s safe to say that the underlying build process has been more or less a complete mystery to me.
To get a better understanding of the process for building Debian packages, I had a try at packaging jpylyzer on my local machine (which runs on Linux Mint 14). Some time ago Dave Tarrant and Rui Castro wrote a nice step-by-step guide on building Debian packages on the OPF Wiki, so I tried to follow the instructions there. While working on this, I made some notes, mainly to remind myself of what I was doing. Then I realised that some of this might be useful to others as well, so I decided to turn it into a blog post.
The most important new feature of the recently released PDF/A-3 standard is that, unlike PDF/A-2 and PDF/A-1, it allows you to embed any file you like. Whether this is a good thing or not is the subject of some heated on-line discussions. But what do we actually mean by embedded files? As it turns out, the answer to this question isn’t as straightforward as you might think. One of the reasons for this is that in colloquial use we often talk about “embedded files” to describe the inclusion of any “non-text” element in a PDF (e.g. an image, a video or a file attachment). On the other hand, the word “embedded files” in the PDF standards (including PDF/A) refers to something much more specific, which is closely tied to PDF’s internal structure.
The PDF format contains various features that may make it difficult to
access content that is stored in this format in the long term. Examples
include (but are not limited to):
- Encryption features, which may either restrict some functionality
(copying, printing) or make files inaccessible altogether.
- Multimedia features (embedded multimedia objects may be subject to
- Reliance on external features (e.g. non-embedded fonts, or
references to external documents)
I’ve already written a number of blog posts on format validation of JP2 files. Format validation is only a one aspect of a quality assessment workflow. Digitisation guidelines typically impose various constraints on the technical characteristics of preservation and access images. For example, they may state that a preservation master must be losslessly compressed, and that its progression order must be RPCL. A format profile is a set of such technical constraints. The process that compares the technical characteristics of a file against a format profile is sometimes called Policy Driven Validation. This corresponds to what JHOVE2 refers to as Assessment (which I think is a better description).
This blog post describes a simple method for doing a rule-based assessment of JP2 images. It uses Schematron, which is a rule-based validation language, to ‘validate’ the output of jpylyzer against a profile. Before getting into any technical details, let’s first have a look at an example of a format profile.