Some four years ago I wrote a blog post that demonstrated how Apache Preflight (the PDF/A validator tool that is part of Apache PDFBox) can be used to detect features in a PDF that are potential preservation risks. A follow-up blog applied Schematron rules to the Preflight output in an attempt at doing policy-based assessments. The results of that work were quite promising, but dealing with Preflight’s multitude of (especially font-related) validation errors proved to be a challenge.
The idea of using a PDF/A validor for policy-based assessments of “regular” PDF files (i.e. PDFs that are not necessarily PDF/A) was explicitly addressed as a use case for veraPDF. With VeraPDF now having entered its “final testing phase”, I thought this was a good time for a small test-drive of veraPDF’s capabilities in this area. All test results are based on VeraPDF 1.4.7.
The development work on an imaging/ripping workflow for optical media is shaping up steadily, and you can expect a write-up with more information about our software and hardware setup here in the near future (you can get a sneak peek here). However, this blog is about a very specific problem that we ran into while testing the workflow with a selection of discs from our collection. This selection included a few discs that follow the Blue Book standard (also known as CD-Extra). This standard defines a method for combining audio and data tracks on one disc. A CD-Extra disc contains two sessions, where the first session holds all audio tracks, and the second session holds a data track. Blue Book was (and still is) widely used for audio CD’s with bonus videos or software.
In my previous blog post I addressed the detection of broken audio files in an automated workflow for ripping audio CDs. For (data) CD-ROMs and DVDs that are imaged to an ISO image, a similar problem exists: how can we be reasonably sure that the created image is complete? In this blog post I will discuss some possible ways of doing this using existing tools, along with their limitations. I then introduce Isolyzer, a new tool that might be a useful addition to the existing methods.
At the KB we have a large collection of offline optical media. Most of these are CD-ROMs, but we also have a sizeable proportion of audio CDs. We’re currently in the process of designing a workflow for stabilising the contents of these materials using disk imaging. For audio CDs this involves ‘ripping’ the tracks to audio files. Since the workflow will be automated to a high degree, basic quality checks on the created audio files are needed. In particular, we want to be sure that the created audio files are complete, as it is possible that some hardware failure during the ripping process could result in truncated or otherwise incomplete files.
To get a better idea of what software tool(s) are best suitable for this task, I created a small dataset of audio files which I deliberately damaged. I subsequently ran each of these files through a set of candidate tools, and then looked which tools were able to detect the faulty files. The first half of this blog post focuses on the WAVE format; the second half covers the FLAC format (at the moment we haven’t decided on which format to use yet).
Earlier this week the National Archives of the Netherlands (NANeth) published a report on preferred file formats. It gives an overview of NANeth’s ‘preferred’ and ‘acceptable’ formats for 9 content categories, and also explains the reasoning behind the selected formats. Even though in Dutch language only, the report is well worth a look. However, I found a few of the choices a little surprising, especially the ‘spreadsheet’ category for which it lists the following ‘preferred’ and ‘acceptable’ formats:
|ODS, CSV, PDF/A
The report gives the following explanation on the ‘preferred’ formats (translated from Dutch):
- ODS - ODS is part of the OpenDocument standard (ODF, NEN-ISO/IEC
26300:2007), which is listed as the standard for office documents on the ‘act or explain’ list of ‘Forum Standaardisatie’
- CSV - for the storage of non-interactive information in cells, a comma-delimited (.csv) text file can be used instead of a spreadsheet
- PDF/A - PDF/A is a widely used open standard and a NEN/ISO standard (ISO:19005). PDF/A-1 and PDF/A-2 are part of the ‘act or explain’ list of ‘Forum Standaardisatie’. Note: some (interactive) functionality will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A
In the remainder of this blog post I will pinpoint some problems of the choice of PDF/A and its justification.