The PDF format contains various features that may make it difficult to
access content that is stored in this format in the long term. Examples
include (but are not limited to):
- Encryption features, which may either restrict some functionality
(copying, printing) or make files inaccessible altogether.
- Multimedia features (embedded multimedia objects may be subject to
- Reliance on external features (e.g. non-embedded fonts, or
references to external documents)
I’ve already written a number of blog posts on format validation of JP2 files. Format validation is only a one aspect of a quality assessment workflow. Digitisation guidelines typically impose various constraints on the technical characteristics of preservation and access images. For example, they may state that a preservation master must be losslessly compressed, and that its progression order must be RPCL. A format profile is a set of such technical constraints. The process that compares the technical characteristics of a file against a format profile is sometimes called Policy Driven Validation. This corresponds to what JHOVE2 refers to as Assessment (which I think is a better description).
This blog post describes a simple method for doing a rule-based assessment of JP2 images. It uses Schematron, which is a rule-based validation language, to ‘validate’ the output of jpylyzer against a profile. Before getting into any technical details, let’s first have a look at an example of a format profile.
The purpose of this post is to give a brief introduction to creating, editing and submitting format signatures (or ‘magic’ entries) for the well-known File tool. The occasion for this was some work I did last week on improving File’s identification of the JPEG 2000 formats. I had some difficulty finding any easy-to-follow documentation that describes how to do this. The information is all out there, but it’s pretty fragmented. So, I wrote this brief tutorial, which is intended as an accessible introduction to magic editing. It only covers the very basics, but hopefully this is enough to overcome some initial stumbling blocks.
In this blog post I’ll be dusting off some old stuff for a change. The occasion for this is the following question, posted by Paul Wheatley on the Libraries and Information Science Stack Exchange website a few days ago:
What preservation risks are associated with the PDF file format?
Over the last few years, the EPUB format has gained widespread popularity in the consumer market. The KB has been approached by a number of publishers that wish to use EPUB for delivering some of their electronic publications. Surprisingly little information is available on the format’s suitability for archival preservation, apart from Library of Congress’ Sustainability of Digital Formats web pages, which contain entries on EPUB 2 and EPUB 3.
So, the KB’s Departments of Collection and Collection Care requested a more detailed investigation of EPUB’s preservation credentials. More specifically, answers were needed to the following questions:
What are the main characteristics of EPUB?
What functionality does EPUB provide, and is this sufficient for representing e.g. content with sophisticated layout and typography requirements?
How well is the EPUB supported by software tools that are used in (pre-)ingest workflows?
How suitable is EPUB for archival preservation? What are the main risks?