The KB has quite a large collection of offline optical media, such as CD-ROMs, DVDs and audio CDs. We’re currently investigating how to stabilise the contents of these materials using disk imaging. During the initial phase of this work I did a number of tests with various open-source tools. It’s doubtful whether we’ll end up using these same tools in our actual workflows. The main reason for this is the sheer size of the collection, which we estimated at some 15,000 physical carriers; possibly even more. At those volumes we will need a solution that involves the use of a disk robot, and these often require dedicated software (we still need to investigate this more in-depth).
Nevertheless, throughout the initial testing phase I was surprised at the number of useful tools that are available in the open source domain. Since this will probably be of interest to others as well, I decided to polish a selection from my rough working notes into a somewhat more digestible form (or so I hope!). I edited my original notes down to the following topics:
- How to figure out the device path of the CD drive
- How to create an ISO image from a CD-ROM or DVD
- How to check the integrity of the created ISO image
- How to extract audio from an audio CD
In addition there’s a final section that covers my attempts at imaging a multisession / mixed mode CD. The result of this particular exercise wasn’t all that successful, but I included it anyway, as some may find it useful. All software mentioned here are open-source tools that are available for any modern Linux distribution (I’m using Linux Mint myself). Some can be used under Windows as well using Cygwin.
Today my attention was caught by this report of an “Expert round table” on JPEG2000 and Digitisation, which was published on the TownsWeb Archiving blog. Although the report as a whole is quite balanced, it’s unfortunate that it provides fuel to some long-running myths about JPEG 2000 not supporting fully lossless compression. Since I wasn’t able to leave a comment on the Townweb blog itself, I turned my response into this small blog post.
This is the second and final instalment of a 2-part blog on the use of PDF/A validators for identifying preservation risks in PDF. You can read the first part here. In Part 1 I showed how PDF/A validators can be used to identify preservation risks in a PDF. I illustrated this with an example that uses the PDF/A validator component of Adobe Acrobat’s Preflight tool. Needless to say, Acrobat is not scalabe to situations where you need to verify large volumes of PDFs. Luckily, several stand-alone PDF/A validators exist that are designed especially to do just that.
This is the first instalment of a 2-part blog. It was prompted by the upcoming Digital Preservation Coalition briefing When is a PDF not a PDF?, for which I was asked to prepare a presentation. My initial idea was to give an overview of the work we did on PDF preservation risk assessment using a PDF/A validator in the SCAPE project. Most of this has already been covered by a series of earlier blog posts. Those blogs very much represent different stages of a work in progress, and I think this makes them somewhat challenging for readers who are new to the subject.
The current version of the KB’s digital repository system (e-Depot) doesn’t include any tools for automated file format identification yet. Our previous DIAS system didn’t have identification functionality either. As a result, information on file formats in digital our collections is largely based on publisher metadata and file extensions. Neither are necessarily correct. Moreover, previous analyses revealed a number of prevalent file extensions that could not be easily linked to a specific format. One result of this situation was that we couldn’t even reliably tell to what extent patrons were able to view e-Depot content on the PCs in our reading rooms (the obviously common formats aside).
To get a better view of the formats in our collection, we did an analysis of the “top 50” most prevalent file extensions in our e-Depot: what are the corresponding formats, can these formats be automatically identified, and can we render them in our reading rooms? This blog post summarises the main findings of this work.