the collection

Do you know about me?

Stand forth, and with bold spirit relate what you,
Most like a careful subject, have collected…
Henry VIII

There are tons of materials in the world that can help us accomplish our mission. Many of them have been digitized in some way, and many of those media files are in the public domain. As part of project willshake, we maintain a collection of such resources, which we catalog for use throughout the system.

Of course, our focus is on Shakespeare. But the discussion here is essentially domain-agnostic, and it may be of interest to any project wishing to collect digital resources for redistribution.

Resources include various types of media.

1 motivation

Maybe this can move up to why new media.

In Bret Victor’s terms, we want to use the most human capabilities: seeing, hearing, feeling, moving, acting. Admittedly, the available resources will only take us so far in that effort.

The “new media” that would fully engage people in all of our faculties just hasn’t been invented. What we have instead are the “old” media.

They offer at best one-way communication. They are “dead,” as such, and by themselves, they cannot help us create the new media that we hope to realize.

We will use it as well as we can to engage people humanely. Clearly, using these resources is better than not using them.

1.1 usage

Use them after your own honour and dignity: the less
they deserve, the more merit is in your bounty.
Hamlet

Our primary concern is with using the available resources. Collecting and cataloging are done in service of that purpose. We are not collecting things just for the sake of it. All of this is driven by usage. It would be easy to get caught up in the act of maintaining a collection for its own sake.

2 prior art

There are many collections of Shakespeare-related materials, both online and off. Perhaps the most famous of the “flesh and blood” collections is the Folger Library.

File:Interior,_Folger_Shakespeare_Library.jpg

2.1 The Public Domain Review

In the online space, the closest thing I’ve seen to this approach is the Public Domain Review (at publicdomainreview.org). Although the focus is not Shakespeare but generally “the surprising, the strange, and the beautiful,” the Review obviously shares some of the values and method with this project.

Founded in 2011, The Public Domain Review is an online journal and not-for-profit project dedicated to the exploration of curious and compelling works from the history of art, literature, and ideas.

In particular, as our name sugggests, the focus is on works which have now fallen into the public domain, that vast commons of out-of-copyright material that everyone is free to enjoy, share, and build upon without restriction. Our aim is to promote and celebrate the public domain in all its abundance and variety, and help our readers explore its rich terrain – like a small exhibition gallery at the entrance to an immense network of archives and storage rooms that lie beyond.¹

I recognize in this description the vertigo brought on by such an undertaking.

One of the site’s admirers calls it “a beautiful marriage."² The marriage is between the public domain (“mostly the fabulous Internet Archive”) and the site’s creators, who select and present the materials. As fabulous as the Internet Archive is, it naturally more like a library than a gallery: a great place for research but not necessarily recreation. The same can be said for other exceptional digital archives, including Wikimedia Commons and Project Gutenberg. The Public Domain Review shows that there is a place for the curation of digital media resources, even when they are freely available elsewhere.

3 advantage of digital collection over physical collection

There are only, what, ≈80 known copies of the First Folio in the world? This impacts their usability rather dramatically. Next year, in celebration of Shakespeare’s 400^th birthday, a copy of F1 is coming to the Sam Noble Museum of Natural History in Oklahoma, where visitors will surely not be allowed to touch the rare book. Oklahomans wanting to see (and not touch) a First Folio on non-quadricentennial years must content themselves with traveling to another state.

4 media types

This document is only about the general aspects of the collection. It should be possible to use the collection in new ways without impacting the common features discussed here.

But we can safely say that all resources will have a media type, such as “image” or “audio” or “book.”

To put it another way, adding a resource is a matter of adding data, while adding a media type requires new programs.

5 naming

What’s in a name? that which we call a rose
By any other name would smell as sweet
Romeo and Juliet

Each resource in the collection will have a unique name. This is a standard practice in information systems.

Specific points about resource names will be given for each type (see, for example, naming images). We’ll say just a few things about the names generally.

5.1 names are meaningful

We will use meaningful names for resources, as opposed to arbitrarily-assigned numbers, guids, or other nonsense (such as JCBZevxI). Perhaps the

5.2 there is only one namespace

That is, all resources are all in the same “pool,” regardless of type. Generally, resources will be partitioned by media type. In other words, we won’t usually deal with images and audio at the same time, so even if an image and a recording had the same name, it wouldn’t be much of a problem in practice. But we might want to distinguish them

6

As we’ve said elsewhere, willshake is essentially a self-contained product.

This makes it low-tech.

Insofar as the collection is part of willshake, it must enjoy the same portability.

This means that we must be able to distribute the collection with willshake.

But the resources in the collection are copyrightable. So we can’t legally distribute it without securing the rights to do so.

In particular, it means that we can only consider public-domain resources for the collection.

The great thing about a digital collection is that anyone can make one, using only an Internet collection.

Of course, a digital facsimile is not as good as the real thing—if by “good” you mean

What we’d like to do is to treat the collection as something that is part of the system. (WHY?) Because we want the system to be portable. (WHY?) Because we want it to work offline. (WHY?) Because it works better offline, while requiring much less infrastructure. This is already discussed elsewhere.

That’s part of our usage. We want people to be able to use the collection in the same way that they use any other part of the system.

7 licensing

Our collection consists of copyrightable materials.

We only use media with permission. By far the easiest way to do this is to stick with public domain media. The alternative is to use copyrighted media with permission. As long as willshake is non-commercial, this still leaves a substantial body of content available for use.

8 our collection is a copy

Lady, you are the cruell’st she alive,
If you will lead these graces to the grave
And leave the world no copy.
Twelfth Night

We will always have copies of everything in the collection. The fact that we may source them from somewhere else as a practice is merely a technical convenience, which keeps the repository from getting too large. The actual source of the files should be effectively transparent. Just because an external source may be very well established as an “archival” collection in its own right, there is simply no harm in keeping our own copies of resources that we use.

9 the catalog

The catalog is our compendium of information about the resources in the collection.

For each resource in the collection, we have a record telling what we need to know about it. This is called its metadata. Americans should be familiar with the concept of “metadata” thanks to the month in 2013 when most daily news reports included a little reminder of what “metadata” means. In case you’ve forgotten, weren’t listening, or aren’t American, metadata is “data about data.”

These records are very important. Without a record, we can’t really use a resource, because we don’t know anything about it—what it’s called, where it’s stored, what it’s relevant to, and so on. In some cases, we need the record just to get the actual resource in the first place. Without a record, the resource effectively doesn’t exist.

9.1 data structure

We keep the catalog in a file database, where each record is a file.

Why don’t we use a “real” database for this? No reason, in principle. In practice, read on.

Why don’t we use a single file, rather than one file per record? Isn’t that wasteful, especially when most of the files will be under 4 kilobytes, the block size of a typical file system? The short answer is that, because we use a file-based build system, keeping each record in a separate file makes it easy (or let’s say, idiomatic) to map each image to a processing pipeline. We might add that, as compared to keeping everything in a single file, the one-file-per record approach has the advantage of reflecting the fact that these are in no particular order.

9.2 sets? batches?

We’d like to maintain the invariant that one folder in the repository (that is, a folder of non-generated files) represents all of the image records, from any source.

Yet, it would be convenient, in the case of, e.g. a set of images like the Howards, to express the metadata in the most “natural” format, which in their case is the one that I wrote while working on them. It prevents you from having to repeat the artist and dates, etc. But if you needed to, you could share that in other ways, e.g. by identifying it as part of a set. That would be more meaningful, anyway, than projecting out those values.

10 sourcing

ALL RESOURCES ARE LOCAL

The collection is local. Everywhere in this document—and indeed, throughout the system—we treat the collection as local, that is, something we have a copy of at hand.

Assumptions we make because we know that the collection is local:

that when you test the system, the resources will be there
that when you “ship” the “product,” the resources will be included

Creating the illusion that all resources are local.

That’s not the whole point of this document, but that’s the whole point of at least one section, and I think it’s this one.

We get things that aren’t local so that, yay, now they are local.

But that’s still not all that’s going on here:

zero, one, or two hops. what to call this?
mirroring. saves bandwidth. is this anything more than a defense against flawed build rules?

Most of the resources come from other systems. We prefer to collect resources from well-known, archival sources. So we don’t need to include them in the repository. Instead, we download the files as part of the build.

Resources may come from other systems, or they may come from within the system. We want to treat these uniformly, so far as we can.

Although we’re sticking with “public domain” media, there are yet a number of sources where we can get content. That said, the overwhelming majority of the media files used by willshake come from Wikimedia Commons. Indeed, we consider the Commons to be the “canonical” place for media files in the sense that we’ll only use another source when the file is not already there and we can’t put it there ourselves.

There are considerations common to the flow of files into willshake from other systems.

10.1 hops

We want to treat all resources as local. We classify resources by how many steps it takes to make them local.

In terms of the flow of files, we define three kinds of resources:

local: resources stored in the repository. These don’t need to be downloaded at all. The purpose of the record is to associate metadata with a resource.
internet: resources from the internet at large. We will use such resources only in exceptional cases, since the internet at large is a very unreliable source of files, with no mechanical way to find out licensing, provenance, or metadata of any kind, really. Still, for something that’s really great and which appears to be public domain, we’ll use this while trying to have the file put into an archival source.
catalog: from an external catalog, where a “catalog” has an API. In such cases, we first get a metadata record, which in turn tells where the resource may be downloaded.

The reason these sources are distinct for build purposes is the number of steps required to get to the actual resource.

Changing the source of a resource should not impact its usage. For example, we will generally try to get resources onto the Commons, if they aren’t already. When this happens, we would update our records to reference the new location. But the result should be the same.

This is where we do the work to allow us to pretend that all resources are in fact local.

10.2 downloading and mirroring

That said, downloads are expensive, for both the giver and the taker. The download must happen once, and nothing should trigger a repeat download (except the absence of the target file, which would only happen for external reasons, i.e., that it was manually deleted).

This is the part where we download the images, which are sometimes huge files on non-profit servers. The last thing we want to do is hit those servers over and over again just because we’re tweaking our build rules. So you’ll notice that we use a local directory outside of the project for the actual media files. Inside of the project, we just create symbolic links to those files, which is extremely cheap once the files are downloaded.

The get-resource script we use for downloading the collection is defined in the internet.

10.2.1 what if the resource changes?

The get-resource script doesn’t check whether or not the original resource has changed since it was copied. So your local copy will not reflect such changes unless you delete the local copy (and the symlink). It’s not a “mirror”; it’s a caching proxy.

Keeping the media files outside of the project directory means that they will not trigger any build actions. Of course, that’s largely the point: we assume that the remote files will never (or almost never) change. Ideally, we could avoid the repeated downloading of the images while still guaranteeing that, say, if one were deleted, that it would be re-fetched. Maybe that’s what Tup’s “full deps” option would do, or maybe it would land us right back where we started.

10.2.2 in terms of a build graph

Further to the above…

Most resources will go through some post-processing, sometimes expensive. We’ll shrink and crop images, we’ll index books.

We take pains at each step to ensure that this work is not done repeatedly for no reason.

To this end, we make liberal use of Tup’s ^o^ flag, which short-circuits subsequent processing if the output file hasn’t actually changed. Specifically, as the manual states, this flag

causes the command to compare the new outputs against the outputs from the previous run. Any outputs that are the same will not cause dependent commands in the DAG to be executed. For example, adding this flag to a compilation command will skip the linking step if the object file is the same from the last time it ran.

This is especially useful in the case of files that represent download locations, which are small and easy to check.

Probably move the note on ^o^ even further down, to the system.

Footnotes:

“About” http://publicdomainreview.org/about/

Coment from the Hacker News user “bane” https://news.ycombinator.com/item?id=10790250