My previous blog Assessing file format risks: searching for Bigfoot? resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields. However, my reply turned out to be a bit more lengthy than I meant to, so I decided to turn it into a separate blog entry.
Last week someone pointed my attention to a recent iPres paper by Roman Graf and Sergiu Gordea titled “A Risk Analysis of File Formats for Preservation Planning”. The authors propose a methodology for assessing preservation risks for file formats using information in publicly available information sources. In short, their approach involves two stages:
Collect and aggregate information on file formats from data sources such as PRONOM, Freebase and DBPedia
Use this information to compute scores for a number of pre-defined risk factors (e.g. the number of software applications that support the format, the format’s complexity, its popularity, and so on). A weighted average of these individual scores then gives an overall risk score.
This has resulted in the “File Format Metadata Aggregator” (FFMA), which is an expert system aimed at establishing a “well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts”.
Like many other organisations that are using JPEG 2000, the KB produces two representations of most of its digitised content (newspapers, books, periodicals):
- a high-quality, losslessly compressed JP2 that is the archival master;
- a lesser-quality, lossily compressed JP2 that is used as an access image (this is used for e.g. our newspapers website).
The majority of our digitisation work is contracted out to external suppliers, and both master and access images are typically derived from from a parent (TIFF) image, which is converted to JP2 using the settings for master and access images, respectively. This means that we’re not currently using the archival masters for producing derived images. However, there may be a need for this at some point in the future. For instance, we may need higher quality access images, or access images that give better performance in our access environment. Because of this, I was asked to take a further look into ways to derive access JP2s directly from our archival masters.
In this blog post I’ll be sharing some preliminary findings of this work, which may be of interest to other JPEG 2000 practitioners as well. All images and test results that I’ll be showing along the way are available from this Github repository, so you can have a go at these data yourself, if you’re so inclined.
Last winter I started a first attempt at identifying preservation risks in PDF files using the Apache Preflight PDF/A validator. This work was later followed up by others in two SPRUCE hackathons in Leeds (see this blog post by Peter Cliff) and London (described here). Much of this later work tacitly assumes that Apache Preflight is able to successfully identify features in PDF that are a potential risk for long-term access. This Wiki page on uses and abuses of Preflight (created as part of the final SPRUCE hackathon) even goes as far as stating that “Preflight is thorough and unforgiving (as it should be)”. But what evidence do we have to support such claims? The only evidence that I’m aware of, are the results obtained from a small test corpus of custom-created PDFs. Each PDF in this corpus was created in such a way that it includes only one specific feature that is a potential preservation risk (e.g. encryption, non-embedded fonts, and so on). However, PDFs that exist ‘in the wild’ are usually more complex. Also, the PDF specification often allows you to implement similar features in subtly different ways. For these reasons, it is essential to obtain additional evidence of Preflight’s ability to detect ‘risky’ features before relying on this tool in any operational setting.
It’s been more than two years now since I wrote my D-Lib paper JPEG 2000 for Long-term Preservation: JP2 as a Preservation Format. From time to time people ask me about the status of the issues that are mentioned in that paper, so here’s a long overdue update.