Detecting broken ISO images: introducing Isolyzer

13 January 2017

In my previous blog post I addressed the detection of broken audio files in an automated workflow for ripping audio CDs. For (data) CD-ROMs and DVDs that are imaged to an ISO image, a similar problem exists: how can we be reasonably sure that the created image is complete? In this blog post I will discuss some possible ways of doing this using existing tools, along with their limitations. I then introduce Isolyzer, a new tool that might be a useful addition to the existing methods.

More ...

Breaking WAVEs (and some FLACs too)

04 January 2017

At the KB we have a large collection of offline optical media. Most of these are CD-ROMs, but we also have a sizeable proportion of audio CDs. We’re currently in the process of designing a workflow for stabilising the contents of these materials using disk imaging. For audio CDs this involves ‘ripping’ the tracks to audio files. Since the workflow will be automated to a high degree, basic quality checks on the created audio files are needed. In particular, we want to be sure that the created audio files are complete, as it is possible that some hardware failure during the ripping process could result in truncated or otherwise incomplete files.

To get a better idea of what software tool(s) are best suitable for this task, I created a small dataset of audio files which I deliberately damaged. I subsequently ran each of these files through a set of candidate tools, and then looked which tools were able to detect the faulty files. The first half of this blog post focuses on the WAVE format; the second half covers the FLAC format (at the moment we haven’t decided on which format to use yet).

More ...

PDF/A as a preferred, sustainable format for spreadsheets?

09 December 2016

Earlier this week the National Archives of the Netherlands (NANeth) published a report on preferred file formats. It gives an overview of NANeth’s ‘preferred’ and ‘acceptable’ formats for 9 content categories, and also explains the reasoning behind the selected formats. Even though in Dutch language only, the report is well worth a look. However, I found a few of the choices a little surprising, especially the ‘spreadsheet’ category for which it lists the following ‘preferred’ and ‘acceptable’ formats:

Preferred Acceptable

The report gives the following explanation on the ‘preferred’ formats (translated from Dutch):

  • ODS - ODS is part of the OpenDocument standard (ODF, NEN-ISO/IEC 26300:2007), which is listed as the standard for office documents on the ‘act or explain’ list of ‘Forum Standaardisatie’1
  • CSV - for the storage of non-interactive information in cells, a comma-delimited (.csv) text file can be used instead of a spreadsheet
  • PDF/A - PDF/A is a widely used open standard and a NEN/ISO standard (ISO:19005). PDF/A-1 and PDF/A-2 are part of the ‘act or explain’ list of ‘Forum Standaardisatie’. Note: some (interactive) functionality will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A

In the remainder of this blog post I will pinpoint some problems of the choice of PDF/A and its justification.

More ...

Valid, but not accessible: crazy fixed EPUB layouts

04 April 2016

EpubCheck is an invaluable tool for assessing the quality of EPUB files. Still, it is possible that EPUBs that are valid according to the format specification (and thus EpubCheck) are nevertheless inaccessible to some users. Some weeks ago a colleague sent me an EPUB 2 file that produced some really strange behaviour across a number of viewer applications. For a start, the text wouldn’t reflow properly after re-sizing the viewer window, and increasing the font size resulted in garbled text. Running the file through EpubCheck did return some validation errors, but none of these were related to the behaviour I was getting. Closer inspection revealed some very peculiar stylesheet and HTML use.

More ...

The future of EPUB? A first look at the EPUB 3.1 Editor’s draft

10 March 2016

About a month ago the International Digital Publishing Forum, the standards body behind the EPUB format, published an Editor’s Draft of EPUB 3.1. This is meant to be the successor of the current 3.0.1 version. IDPC has set up a community review, which allows interested parties to comment on the draft. The proposed changes relative to EPUB 3.0.1 are summarised in this document. A note at the top states (emphasis added by me):

The EPUB working group has opted for a radical change approach to the addition and deletion of features in the 3.1 revision to move the standard aggressively forward with the overarching goals of alignment with the Open Web Platform and simplification of the core specifications.

As Gary McGath pointed out earlier, this is a pretty bold statement for what is essentially a minor version. The authors of the draft also mention that they expect it “will provoke strong reactions both for and against”, and that changes that raise “strong negative reactions” from the community “will be reviewed for future drafts”.

This blog post is an attempt to identify the main implications of the current draft for libraries and archives: to what degree would the proposed changes affect (long-term) accessibility? Since the current draft is particularly notable for its aggressive removal of various existing EPUB features, I will focus on these. These observations are all based on the 30 January 2016 draft of the changes document.

More ...