The PDF format has a number of features that don’t sit well with the aims of long-term preservation and accessibility. This includes encryption and password protection, external dependencies (e.g. fonts that are not embedded in a document), and reliance on external software. In this post I’ll review to what extent such features can be detected using VeraPDF and JHOVE. It further builds on earlier work I did on this subject between 2012 and 2017.
This blog post provides a brief introduction to extracting unformatted text from EPUB files. The occasion for this work was a request by my Digital Humanities colleagues who are involved in the SANE (Secure ANalysis Environment) project. The work on this project includes a use case that will use the SANE environment to analyse text from novels in EPUB format. My colleagues were looking for some advice on how to implement the text extraction component, preferably using a Python-based solution.
So, I started by making a shortlist of potentially suitable tools. For each tool, I wrote a minimal code snippet for processing one file. Based on this I then created some simple demo scripts that show how each tool is used within a processing workflow. Next, I applied these scripts to two data sets, and used the results to obtain a first impression of the performance of each of the tools.
I recently moved the two Internet domains I own away from the UK-based domain registrar I’d been using since 2004 to a EU-based registrar. While the actual domain transfer was fairly simple, finding a registrar that suited my specific situation turned out more difficult than expected. Leaving my old registrar also resulted in a surprise. It’s unlikely that my situation is unique, so I thought it would be useful to share my experiences in this blog post, and point to some useful online resources that I found along the way. The move also allowed me to make my domains up to date with (mostly security-related) modern internet standards. I’ll briefly address this in the final sections of this post. This includes some suggestions on how to make these optimizations work with a GitHub Pages-hosted sites like this one.
In 2017 I wrote a blog post on Iromlab (an acronym for “Image and Rip Optical Media Like A Boss”), a custom-built software tool that streamlines imaging and ripping of optical media using an Acronova Nimbie disc robot. The KB has been using Iromlab since 2019 as part of an ongoing effort to preserve the information contained in its vast collection of legacy optical media. This project is expected to reach its completion later this year, but as demonstrated by this earlier inventory, our deposit collection also contains various other types of legacy media that are under threat of becoming inaccessible. Out of these, 3.5 inch floppy disks are the most common data carriers (after optical media), so it made sense to focus on these as a next step.
Using the existing Iromlab-based workflow as a starting point, I created a preliminary workflow tool that can be used for imaging our 3.5” floppies (and various other types portable media). In this post I’ll explain how this tool came about, and highlight some of the challenges I encountered during its development.
As a total collapse of Twitter is becoming more likely every day, many Twitter users have started to archive their personal data from the platform while it still exists. Twitter allows you to request and download your personal archive. Even though this works well, and the quality of the archive is surpringly good, it does have some shortcomings. The result of these shortcomings will be that some information in the archive (e.g. on followed accounts and followers) will be lost once Twitter ceases to exist. Some other information (in particular full, unshortened URLs) is included in the archive, but it is not easily accessible from the main HTML interface. The good news is, that some excellent tools exist to fix these shortcomings.
In this post I outline the workflow I used to preserve my own Twitter archive, and while doing so I also provide some background information on the shortcomings of the Twitter archive. Since some of these steps may, at first sight, be a little daunting for less tech-savvy readers, I’ve tried to provide step-by-step instructions where possible.