Crawling offline web content: the NL-menu case

11 July 2018

In a previous blog post I showed how we resurrected NL-menu, the first Dutch web index. It explains how we recovered the site’s data from an old CD-ROM, and how we subsequently created a local copy of the site by serving the CD-ROM’s contents on the Apache web server. This follow-up post covers the final step: crawling the resurrected site to a WARC file that can be ingested into our web archive.


Resurrecting the first Dutch web index: NL-menu revisited

24 April 2018

NL-menu was the first Dutch web index. The site was originally founded by a consortium of SURFnet, Dutch universities and the KB. From the mid-nineties onwards it was maintained solely by the KB. NL-menu was discontinued in 2004, after which the site was taken offline. In 2006 the domain name was sold to a private company that used it for hosting a web index that was partially based on the original NL-menu site.


Update on Isolyzer: UDF, HFS+ and more!

12 July 2017

Earlier this year I blogged about Isolyzer, a tool designed to help the detection of broken ISO images. Today I released a shiny new beta version that adds a significant amount of new functionality. Below is an overview of the main changes, followed by some warnings and caveats.


Image and Rip Optical Media Like A Boss!

19 June 2017

Over the last months we’ve been working on the development of a provisional workflow for preserving the content of optical media in our collection. The main result thus far is Iromlab, a custom workflow application that streamlines the imaging and ripping process. This blogpost gives an overview of Iromlab, as well as the reasons why we created it in the first place.


Policy-based assessment with VeraPDF - a first impression

01 June 2017

Some four years ago I wrote a blog post that demonstrated how Apache Preflight (the PDF/A validator tool that is part of Apache PDFBox) can be used to detect features in a PDF that are potential preservation risks. A follow-up blog applied Schematron rules to the Preflight output in an attempt at doing policy-based assessments. The results of that work were quite promising, but dealing with Preflight’s multitude of (especially font-related) validation errors proved to be a challenge.

The idea of using a PDF/A validor for policy-based assessments of “regular” PDF files (i.e. PDFs that are not necessarily PDF/A) was explicitly addressed as a use case for veraPDF. With VeraPDF now having entered its “final testing phase”, I thought this was a good time for a small test-drive of veraPDF’s capabilities in this area. All test results are based on VeraPDF 1.4.7.



Archive

2020

September

June

April

March

February

2019

September

April

March

January

2018

July

April

2017

July

June

April

January

2016

December

April

March

2015

December

November

October

July

April

March

January

2014

December

November

October

September

August

January

2013

October

September

August

July

May

April

January

2012

December

September

August

July

June

April

January

2011

December

September

July

June

2010

December