This is the first instalment of a 2-part blog. It was prompted by the upcoming Digital Preservation Coalition briefing When is a PDF not a PDF?, for which I was asked to prepare a presentation. My initial idea was to give an overview of the work we did on PDF preservation risk assessment using a PDF/A validator in the SCAPE project. Most of this has already been covered by a series of earlier blog posts. Those blogs very much represent different stages of a work in progress, and I think this makes them somewhat challenging for readers who are new to the subject.
The current version of the KB’s digital repository system (e-Depot) doesn’t include any tools for automated file format identification yet. Our previous DIAS system didn’t have identification functionality either. As a result, information on file formats in digital our collections is largely based on publisher metadata and file extensions. Neither are necessarily correct. Moreover, previous analyses revealed a number of prevalent file extensions that could not be easily linked to a specific format. One result of this situation was that we couldn’t even reliably tell to what extent patrons were able to view e-Depot content on the PCs in our reading rooms (the obviously common formats aside).
To get a better view of the formats in our collection, we did an analysis of the “top 50” most prevalent file extensions in our e-Depot: what are the corresponding formats, can these formats be automatically identified, and can we render them in our reading rooms? This blog post summarises the main findings of this work.
Back in 2012 the KB conducted a first investigation of the suitability of the EPUB format for long-term preservation. The KB will soon start receiving publications in this format, and in anticipation of this, our Collection Care department has formulated a policy on the minimum requirements an EPUB must meet to ensure long-term accessibility. The policy largely follows the recommendations from the 2012 report. This blog explores to what extent it is possible to automatically assess the EPUBs that we receive against our policy using a combination of the Epubcheck tool and Schematron rules.
Shortly before Christmas, Dutch daily newspaper Trouw removed 126 articles from its website. These articles were all authored by Perdiep Ramesar, a former journalist of the newspaper. Ramesar had been fired by Trouw in November, after it turned out that many of the sources that are cited in his articles were fabricated. The most notorious example was a series of pieces about the so-called “Sharia Triangle”, a neighbourhood in the city of The Hague, which Ramesar claimed was being ruled by Sharia law. As it turned out, this story was largely based on fabricated sources. Nevertheless, it was taken at face value by most major Dutch news outlets at the time, and even prompted a parliamentary debate.
Trouw’s decision to remove the 126 articles overnight was met with considerable criticism. For example, historian Jan Dirk Snel noted that the removal of these articles makes it impossible to check what was wrong with them in the first place. Various other critics accused Trouw of trying to rewrite history.
Eerder deze week verwijderde dagblad Trouw 126 artikelen van haar website die geschreven waren door ontslagen journalist Perdiep Ramesar. Aanleiding hiervoor was het onderzoek naar door Ramesar opgevoerde “niet traceerbare” bronnen. De beslissing van Trouw om de onbetrouwbare artikelen van de site af te halen stuitte op nogal wat kritiek. Sommigen noemden het geschiedvervalsing. Historicus Jan Dirk Snel merkte terecht op dat nu de stukken zijn verwijderd, niemand meer kan controleren wat er eventueel wel of niet aan deugt.