The Schoenberg Institute for Manuscript Studies at Penn brings manuscript culture, modern technology and people together.


An Ideal Collation of LJS 101

By Jesse McDowell

Certainly the book I was assigned in Will Noel and Dot Porter’s course in Rare Book School, “The Medieval Manuscript in the Twenty-First Century,” ascribed to the course’s title quite seemingly. I spent my time working with a nicely old Carolingian manuscript from 9th-century France, LJS 101. Like many medieval manuscripts, this one has been bound more than once, and so came the use of text-matching and open data to literally restructure the original physicality of the book in digital form.

LJS 101 is a parchment manuscript bound in 10 quires containing Boethius’s Latin translation of Aristotle’s De interpretatione (On Interpretation). It originates from north-central France, most likely the abbey at Saint-Benoît-sur-Loire (also known as the Abbaye de Fleury).

LJS 101 1v-2r

However, the book contains more than a translation. Though LJS 101 originates in the 9th-century, it contains replacement leaves added to the beginning and end in the 11th-century (current fols. 1-4 and 45-64), with seemingly the same hand correcting the first scribe in his translational slips. Also in the 11th century, color was added to existing initials and to diagrams charting Aristotle’s formal connection between language and logic. Though my narrative does not perform anything close to a textual study, we’ll call these two scribes Scribe A (9th-cent hand) and Scribe B (11th-cent hand) in the same vein as textual critics.

Scribe A composed the body of the translation itself from folios 5-44. Scribe B’s contribution is sparser, his hand (as mentioned above) showing complete replacement leaves at folios 1-4 and 45-64, as well as added corrections and glossed material throughout the manuscript. Scribe B introduces new genres in folios 45-64 (as the catalogue describes):

“the Perihermeniae attributed to Apulieus, a poem by Decimus Magnus Ausonius on the seven days of Creation, a sample letter of a monk to an abbot with [more] interlinear and marginal glosses, and other miscellaneous verses, definitions, and excerpts.”

What is most striking about Scribe B’s contribution to this manuscript is the firm variety which it comes with. We know that Carolingian miniscule became a widely used script to compose codices for professional and educational purposes from the 9th to the 13th-century. Scribes used this script widely on into the 12th-century for a variety of reasons, right when books were being produced by a number of different workers in the secular world. The script itself became codified and sharpened as a professional way of composing. If we can approximate the burgeoning use of Carolingian miniscule to c. 800, then we can hypothesize that Scribe A was quite new at this script. Indeed, the spaces between words are much clearer in Scribe B’s contributions in fol. 1-4 and 45-64, as this is one chief characteristic of a well-practiced hand.

In attempting to establish the structure of the book, there seemed to be a discrepancy between my collation formula and that determined by the cataloguer. The manuscript contains a 19th-century foliation and has prickings throughout, though this proved fruitless when trying to establish just how many gatherings the bound object contained. Initially the collation differences presented a problem in establishing structure. In short, the catalogued information accounted for 9 quires to the 10 I came up with over and over again in my own count.

We were both accounting for the same number of leaves in the manuscript. We also both noticed a discrepancy in the current 2nd and 3rd quires. Herein was the problem: with some help from Will Noel, I discovered that the current quires 2-3 were misplaced. The first problem was that Gregory’s Rule – hair side of parchment always faces hair side – was broken, at 8v-9r. Under my collation, there was no way to tell where the ‘missing’ leaf was.

Codicologists meet problems with incorrect binding all the time, and while this binding and foliation didn’t account for the discrepancy in the leaves, re-examination of the book eventually did. The manuscript’s current binding comes in an English diced Russia leather for Sir Thomas Phillips in the 19th-century. The watermark on the pastedown shows a shelfmark for ‘J. Whatman 1832.’ This could very well be a manifestation of the use of old books by antiquarians in the 19th-century who, not well understanding the nature of medieval codices, re-bind and re-label them into new codices for personal keepsakes and exhibition. Elizabeth Kolbert writes to this reality of antiquarians and aristocrats housing artifacts and even fossils in the 19th-century as collectors, rather than researchers. Alas, this seems to epitomize the life of most old books in the hands owners who do not expose them to research. In the same vein, we should not keep books from being digitized. According to Tim Stinson‘s research, less than 2% of the entirety of medieval manuscripts in the world have been digitized. Though this statistic is now a few years old, we might be able to look to the future of accessible manuscripts with a sense of positivity, as recently the Vatican Library put over 4,000 of their manuscripts online for free. But what does this mean for the researcher? Certainly the step now is to not just digitize quickly, but release the manuscripts as Open Data (not with restrictive licenses), as Penn has done with their collections in OPenn.

In the case of LJS 101, I couldn’t examine this central problem of a missing folio without digital images and access to 19th-century editions. I had to find where the text matched up after 8v— for it was not the current 9r. It was time to search for a relatively modern edition of the Latin text itself to reconcile the discrepancy. I found an 1877 edition in HathiTrust (ed. Karl Meiser).

Text-Searching an 1887 Edition

The end of the current 8v contained text that correlated with that of the current 12r.

Bottom 4 lines 8v

So, the text at bottom of 8v reads (from above):

quod significat subiectum est quocirca unū[m] quoque  

where the matching text at the top of 9r doesn’t match up to what follows in the 1877 edition:

reterea non quod nos intelligum eequum 

The text itself notified us of where Gregory’s Rule had been broken—where that of leaf 8r in the currently bound manuscript actually coincides with 12v. What essentially occurred was a mis-binding and foliation done in the 19th-century where the currently bound 12v should ideally be 7v. The text at the bottom of 12v currently reads

apud Scythas amara nec acida, sed apud ipsos quoque

and the following text on 13r doesn’t correlate:

p[er] mixtio ista significat; Quod si unum significant to [to]ta p[er] mixtio pars inde separate nihil extra designat;

From the 1877 Latin edition the correct text after ipsos quoque is sunt dulcia et apud omnes, thus

                        [last line 12v]: apud Scythas amara nec acida, sed apud ipsos quoque [begin 8r]: sunt dulcia et apud omnes gentes eodem modo: ita quoque omnia nomina si naturaliter essent, isdem omnes homines uterentur. 

0241_0029_web  12v-8r  8r web

This is but one example of how the text within the manuscript hadn’t matched up correctly. Instead of continuing to chart out the discrepancies, I’ll explain how I rendered a collation based off these findings. The text was the governing factor in matching up folios in the right order, and on this more minute level, we can see how it logically makes sense to re-puzzle a book whose folios are out of order. On a more general level, all that happened here was that quires 2 and 3 had been separated where they should have been bound together. If rebound, the second and third quire should simply be ‘quire 2.’ Based off the 19th-century foliation, the current folio 5 should ‘ideally’ be 1, 9 should be 2, 10 should be 3, and so on (for of course 8 folios in the quire).

As I presume, Penn isn’t in the business of physically rebinding a 9th-century book, especially when the current binding is in great condition. If we wanted to see this book as it was originally bound, or at least how it was bound before 1832, how could we reposition the folios against the foliation in the upper right-hand corner? We could certainly sit down with a pen and notebook and draw up a new collation, but what if we want to read the text from leaf to leaf as if it was in a correct order?

The interface I used to visualize such a structure was the digital visualization collation, a system initially developed by a collaborative team led by Dot Porter, to visualize collation diagrams based off of a model, rather than by counting and charting by hand. We were informed during our course discussions on collation that this system had been created for visualizing collation models, but we soon learned it can do so much more. At its core, this program provides something that fundamentally invigorates collation methods for medievalists; it can also wear many hats. It can allow for the repurposing of different outlets of methods at the very outset.

The repurposing I refer to came about when I was able to recreate a binding that reflected the original quire structure. Instead of laying out the entire structure online, I used the program to visually capture what couldn’t be imagined without both the digital images and the visualization program. For instance, since the program automatically begins at “Quire 1” with every collation formula, the screenshots provided render “Quire 1” where I am actually visualizing quire 2, a quire 2 that currently doesn’t exist in the book itself.

First we will see quires 1-3 as the book is currently bound.  Scribe A added these leaves making up fol. 1-4 for an introduction and the decorating of a beautiful initial, and what follows is what you would see if you walked into Kislak Center and opened up this book upon request.

Currently bound quires 1-3, Q1:

Screen Shot 2015-11-11 at 6.38.37 PM


Currently bound Q2:

Screen Shot 2015-11-11 at 6.38.56 PM


And currently bound Q3:

Screen Shot 2015-11-11 at 6.39.10 PM


These are screenshots of the visualized quires; they are live online here.

Now onto the digital reconstruction. If you were to walk into the Kislak Center and gaze upon this book, you couldn’t read the book straight through with matching Latin unless you were accompanied by this corrected version:

Here is the live online visualized quire for the reconstruction, and below is the screenshot:


Screen Shot 2015-11-11 at 7.08.01 PMScreen Shot 2015-11-11 at 7.08.14 PM


What this small foray reveals is the strikingly fundamental role that digital scholarship, and digitizing itself, can play in medieval studies. What’s more, this demonstration solicits but one aspect of what the digital world can offer. In the case of transcription practice, data-mining, and textual editing, programs like T-PEN and of course TEI-texts have seriously revitalized what we can see when we evaluate texts and ask the same fundamental questions in order to conduct research. Their value does not necessarily lie in that the programs make life easier, for surely they do, but more that they create a distinctly different paradigm one can adopt when doing scholarship of any kind with any old book.

Leave a comment

Collation Modeling and Visualization: Video Tutorials

Over the past year or so, a group of us at SIMS and elsewhere have been developing a system for visualizing the physical collation of medieval manuscripts. At the moment, this consists of two things:

  1. Figures that illustrate the make-up of quires: number of leaves, whether leaves are missing or added, etc.
  2. Using digital images of manuscript pages to give an idea of how a quire would look, were it disbound: showing how folios that are disjunct in a bound manuscript relate to one another when the manuscript is unbound.

Here is a screenshot of what this looks like:

BL Cotton Claudius b iv, aka the Old English Illustrated Hexateuch. Showing Quire 3 (4, +2).

BL Cotton Claudius b iv, aka the Old English Illustrated Hexateuch. Showing Quire 3 (4, +2).

You can create these yourself, for the manuscripts you are working with! You don’t even need a collation formula. You do need to be able to express the collation, or at least have an idea of which folios go in which quire. One of the nice things about this system, even in the current beta form, is that it can enable you to compare different collations for the same item. It could help you figure it out!

Instructions for building collation models and visualizing them are on Github. You won’t need to download any code, although the code is there if you are interested or curious. If you want the bifolia layout view, you will need to be able to provide an Excel spreadsheet associating folio or page numbers with image files.

Does that still sound like a lot of work? Never fear! I’ve made a set of video tutorials to walk you through the entire process. I hope these are helpful. And if you are still unsure about doing this yourself even after the videos, be aware that I’ll be leading a workshop at the International Congress on Medieval Studies in Kalamazoo, MI, next May. Maybe I’ll see you there! The videos are embedded below. Be sure to click on the “HD” button at the bottom of each video, or else the videos are very blurry.


Libraries Supporting Digital Scholarship: The Schoenberg Institute for Manuscript Studies as an Object Lesson

A version of this talk was presented as the keynote for the annual meeting of the Association of College and Research Libraries – Delaware Valley Chapter, in Philadelphia PA on November 6, 2014.

Thank you very much, and thank you especially to Terry Snyder for inviting me to speak with you all this morning. Today is a good day to talk about the Schoenberg Institute for Manuscript Studies (SIMS); after this talk I will be heading down the hall to attend the annual SIMS Advisory Board meeting, and tomorrow and Saturday I’ll be attending the 7th annual Schoenberg Symposium on Manuscripts in the Digital Age. So this is an auspicious week for all things SIMS.

The topic of this talk is the Schoenberg Institute for Manuscript Studies and how it may be considered an object lesson for libraries interested in supporting digital scholarship. Penn Libraries has invested a lot in SIMS, and while much of SIMS will be very specific to Penn, I hope our basic practices might provide food for thought for other institutions interested in supporting research and scholarship in the library.

SIMS is a research institute embedded in the Kislak Center for Special Collections, Rare Books and Manuscripts in the University of Pennsylvania Libraries. It exists through the generosity and vision of Larry Schoenberg and his wife, Barbara Brizdle, who donated their manuscript collection (numbering about 300 objects) to Penn Libraries, with the agreement that the Libraries would set up an institute to push the boundaries of manuscript studies, including but not limited to digital scholarship. (Although my job focuses on the digital, indeed that term features in my official title, I also have responsibilities for our physical manuscript collections). Penn did this, and SIMS was launched on March 1, 2013. As a research institute we develop our own projects and push our own agenda, and although many of our projects are highly collaborative we do not “serve” scholars; we are scholars.

Guided by the vision of its founder, Lawrence J. Schoenberg, the mission of SIMS at Penn is to bring manuscript culture, modern technology and people together to bring access to and understanding of our intellectual heritage locally and around the world.
We advance the mission of SIMS by:

  • developing our own projects,
  • supporting the scholarly work of others both at Penn and elsewhere, and
  • collaborating with and contributing to other manuscript-related initiatives around the world.

SIMS has 13 staff members, but it is helpful to know that of this list only two are dedicated to SIMS work full-time (Lynn Ransom, Curator, SIMS Programs and Jeff Chiu, Programmer Analyst for the Schoenberg Database of Manuscripts). Everyone else on staff is either part time (the SIMS Graduate Fellows) or has responsibilities in other areas of the libraries, and beyond. Mitch Fraas, for example, is co-director of the Penn Digital Humanities Forum, a hub for digital humanities at Penn hosted through the School of Arts and Sciences.

Over the last couple of weeks, as I have been considering what I might say to you all this morning, I have also been spending a lot of time working on the Medieval Electronic Scholarly Alliance, a federation of digital medieval collections and projects that I co-direct with Tim Stinson, a professor of English at North Carolina State University. MESA is essentially a cross-search for many and varied digital collections, enabling one (for example) to search for a term – we have a fuzzy search that will include variant spellings in a search – and then one can facet the results by format (for example illustrations, or physical objects), discipline, or genre. One can also federate by “resource”, searching only those items that belong to particular collections

Searching MESA for Jerusalem with fuzzy search enabled, limited to format of “Illustration”.

The work that I’ve been doing for MESA over the past two weeks involves taking data provided to us and converting it from whatever format we get, into the Collex RDF XML format required by MESA. In some cases, this is relatively easy. The Walters Art Museum, for example, through its Digital Walters site, provides high-resolution images of their digitized manuscripts using well-described and consistent naming conventions, and also provides TEI-XML manuscript descriptions that are also consistent as well as being incredibly robust. These files are all released under a Creative Commons Attribution-ShareAlike 3.0 Unported license, and they are easy to grab or point to once you know the organization of the site and the naming conventions.

Walters Art Museum manuscripts on The Digital Walters site.

Not all project data is so simple to access.

The British Library Catalogue of Illuminated Manuscripts, although the data is open access (the metadata under a creative commons license, the images are in the public domain), it is “black boxed” – trapped behind an interface. The only way to access the data is to use the search and browsing capabilities provided by the online catalog. To get the data for MESA, our contact at the BL sent me the Access database that acts as the backend for the website, and I was able to convert that to the formats I needed to be able to generate our RDF.

Images from Harley 603 from the British Library Catalogue of Illuminated Manuscripts.

So what does all this have to do with SIMS? Well, as I was doing this conversion work, I had a bit of an epiphany. I realized that pretty much everything we do at SIMS can be described in terms of


And as I thought about how I might describe our various projects in terms of data reuse, I also realized that reuse of data is not new. In fact, it is ancient, and thinking in these terms puts SIMS at the tail end of a long and storied history of scholarship.


I’m not starting at the beginning, but I do want to give you a sense of what I mean when I say that data has been reused for the past couple thousand years (at least). One of my favorite early examples would have to be ancient Greek epics, such as the Iliad.

Iliad. Book 10. 421-434, 445-460, P. Mich. Inv. 6972, Special Collections Library (2nd c. BCE)

Here is a papyrus fragment, housed in the University of Michigan Libraries and dating from the second century BCE, containing lines from Book 10 of the Iliad. Thousands of similar fragments survive, containing variant lines from the poem.

Marciana Library 822, Venetus A, fol. 24r (10th c.)

And this is a page from the manuscript commonly known as Venetus A, Marciana Library 822, the earliest surviving complete copy of the Iliad, dating from the 10th century (a full 12 centuries younger than the papyrus fragment). In addition to the complete text, you can see that there are many different layers of glosses here: marginal, interlinear, intermarginal. These glosses contain variant readings of the textual lines, variants which are in many cases reflected in surviving fragments.

Penn Ms. Codex 1058, Glossed Psalter, fol. 12r (ca. 1100)

My next example is from a Glossed Psalter from our collection, Ms. Codex 1058, dating from around 1100. This manuscript is also glossed, but rather than variant readings, these glosses are comments from Church Fathers, pulled out of the context of sermons or letters or other texts, and placed in the margin as commentary on the psalm text.

Penn Ms. Codex 1640, Thomas of Ireland Manipulus Florum, fol. 114r

This example is a bit later, an early 14th century Manipulus Florum, Ms. Codex 1640. Like the glossed psalter, quotes from the church fathers and other philosophers are again pulled out of context, but in this case they are grouped together under a heading – in this example, the heading is “magister”, or teacher, and presumably the quotes following describe or define “magister” in ways that are particularly relevant to the needs of the author.

Penn LJS 267, De ludo scacchorum seu de moribus hominum et officiis nobilium … fol. 136v

Text is not the only type of data that can be reused, historically or now. We can also reuse material. Can you all see the sign of material reuse here? Check the top and bottom of the page. This is a palimpsest. What’s happened here is that a text was written on some parchment, and then someone decided that the text was no longer important. But parchment was expensive, so instead of throwing it away (or just putting it on a shelf and forgetting about it) the text was washed or scraped off the page, and new text was written over top. We can still see the remnants of the older text.

Penn LJS 395, Manuscript pastedowns from De proprietatibus rerum, back pastedown side 2

This is a page from LJS 395, a 13th century manuscript fragment that’s been repurposed to form part of the binding for a 16th century printed book. This is really typical reuse, and many fragments that survive do so because they were used in bindings.

How about this one?

Penn Ms. Codex 1056, Book of Hours Use of Rouen, ff. 24v-25r

This is a trick question. This is an opening from a 15th century book of hours from our collection, to compare with this.

Penn Ms. Coll 713, Breviary Collages, No. 1

This 17th century Breviary Collage was created by literally cutting apart a 15th century Flemish Breviary and pasting the scraps onto a square of cardboard. It is a bit horrifying, but it’s my favorite example of both reuse of material and, if not reuse of text, then reuse of illustration. Certainly the content is being reused as much as the material. Although I would never do this to a manuscript (and I hope none of you would do this either), I feel like I have a kindred spirit in the person who did this back in the 1800s, someone who saw this Breviary as a source of data to be repurposed to create something new.

I do this, only I do it with computers. Here is my collage.

Collation Visualization for LJS 266`

Okay, it’s not a collage, it’s a visualization of the physical collation of Penn LJS 266 (La generacion de Adam) from the Schoenberg Collection of Manuscripts, just one created as part of our project to build a system for visualizing the physical aspects of books in ways that are particularly useful for manuscript scholars. Collation visualization creates a page for each quire, and a row on that page for each bifolium in the quire. On the left side of each row is a diagram of the quire, with the “active” bifolium highlighted. To the right of the diagram is an image of the bifolium laid out as it would be if you disbound the book, first with the “inside” of the bifolium facing up, then the “outside” (as though the bifolium is flipped over).

To generate a visualization in the current version of collation visualization, 0.1 (the source XSLT files for which are available via my account on GitHub), I need two things: manuscript images, and a collation formula (the collation formula describes the number of quires in a codex, how many folios in each quire, if any folios are missing, that kind of thing). To create this particular visualization, first I needed to get the images.

LJS 266 in Penn in Hand

Our digitized manuscripts are all available through Penn in Hand, which is very handy for looking at manuscript images and reading descriptive information, but much like the British Library database we looked at earlier, it’s a black box.

Downloading an image file from Penn in Hand

It is possible to use “ctrl-click” to save images from the browser, but the file names aren’t accessible (my system reverts to “resolver.jpg” for all images saved from PiH, and it’s up to me to rename them appropriately).

Collation formula for LJS 266 in Penn in Hand (the third entry under Notes:)

The collation formula is in the description, and it’s easy enough for me to cut and paste that into the XSLT that forms the backbone of Collation 0.1.

It is actually possible to get XML from Penn in Hand, by replacing “html” in the URL with “XML”

XML in Penn in Hand

The resulting XML is messy, but reusable – a combination of Dublin Core, MARC XML, and other various non-standard tagsets.

Screenshot of OPenn (under construction)

Because we know how important it is to have clean, accessible data (indeed my own work and other SIMS projects depend on it), we have been working for the past year on OPENN, which will publish high-resolution digital images (including master TIFF files) and TEI-encoded manuscript descriptions (generated from the Penn in Hand XML) in a Digital Walters-style website – Creative Commons licenses for the TEI, and the images will be in the public domain. OPenn is still in development, but will be launched at the end of 2014.

Having consistent data for our manuscripts in OPenn will enable me to do with our data what I already did with the Digital Walters data: programmatically generate collation visualizations for every manuscript in our collection. Because the Digital Walters data was accessible in a way that made it easy for me to reuse it, and was described and named in such a way that it was easy to figure out what images match up with which folio number, I was able to generate collation visualizations for every manuscript represented in the Digital Walters that includes a collation formula, and I was able to do it in a single afternoon. The complete set of visualizations is available here.

Mock-up of collation form

Version 0.2 of Collation will be based on a form (this is the current mock-up of how the form will look), instead of supplying a collation formula one would essentially build the manuscript, quire by quire, identifying missing, added, and replaced folios, and the output would be both a visualization and a formula.

Why do this? It is a new way of looking at manuscripts in a computer, completely different from the usual page-turning view, and one that focuses on the physicality of the book as opposed to its state as a text-bearing object. A new view will hopefully lead to new research questions, and new scholarship.

Moving on from Collation, the standard-bearing project for SIMS (and one that predates SIMS itself by many years) is the Schoenberg Database of Manuscripts (SDBM). This is a project that reuses data on a massive scale, and does it to great effect.

Entry #1 in A Catalogue of the Medieval Manuscripts in the University Library, Aberdeen, By M. R. James (1932)

This photo is the first entry in the catalog of manuscripts at the University of Aberdeen Library, written by M. R. James. This entry, and other entries from this catalog, and from many other library and sales catalogues, have been entered into the SDBM.

Entry from Schoenberg Database of Manuscripts (current version)

Here is that same entry in the current version of the catalog. However! This year Lynn Ransom received a major grant from the NEH to convert the database to new technologies, and I’d rather show you that version.

Entry in the Schoenberg Database of Manuscripts (new version)

So, here is that same entry again in the new version of the Schoenberg Database, which is currently under development. “What is the big deal?” I hear you ask. As well you may. Let me show you a different entry from that same catalogue.

Entry for a record with eight matching records

You can see in this example, on the “Manuscript” line: “This is 1 of 8 records referring to SDBM_MS_5688.” The SDBM is in effect a database of provenance – it records, not where manuscripts are now but where they have been noted over time, through appearances in sales and collections catalogues. This manuscript has eight records representing catalogs dated from 1829 to 1932. This enables us to trace the movement of the manuscript during the time represented in the database.

Eight records for a single manuscript from SDBM.

Why create the Schoenberg Database? Although it was begun by Lawrence Schoenberg as a private database, which enabled him to track the price of manuscripts, we develop it now to support research around manuscript studies, and around trends in manuscript collecting. Study of private sales in particular could be useful in other areas of studies, such as economic history (since manuscripts are scarce, and expensive, and people will be more likely to purchase them and pay more money for them when they have money to spare).

A new project, one that we have been working on just this year, is Kalendarium. Instead of a database consisting of manuscript descriptions from catalogs, Kalendarium will be a database consisting of data from medieval calendars themselves.

Calendar from Ms. Codex 1056, Book of Hours Use of Rouen, ff. 1v-2r

This is a couple of pages of a calendar from Penn Ms. Codex 1056, a 15th century Book of Hours. Calendars, common in Books of Hours, Breviaries and Psalters, essentially list saints and other celebrations for specific days of the month. Importance may be indicated by color, as you can see here some saints names are written in gold ink while most are alternating red and blue (red and blue being equally weighed, and gold used for more important celebrations).

A major expectation of Kalendarium is that the data will be generated through crowdsourcing, that is, we’ll build a system where librarians can come and input the data for a manuscript in their collection, or scholars and students can input data for a calendar they find online, or while they are looking at a manuscript in a library. The thing is, transcribing these saints names can be difficult, even for someone trained in medieval handwriting. So, instead of transcriptions, we’ll be enabling people to match saints’ names and celebrations to an existing list. And where do we get that list?

Ask and ye shall receive. In the late 1890s, Hermann Grotefend published a book, Zeitrechnung des deutschen mittelalters und der neuzeit… (Hannover, Hahn, 1891-98.),  that included a list of saints, and the dates on which those saints are venerated. And it’s on HathiTrust, so it’s digitized, so we can use it!

Well, it’s in Portable Document Format, more commonly known as PDF. Like Penn in Hand and the British Library Catalog of Illuminated Manuscripts, PDF is another kind of black box. Although it’s fine for reading, it’s not good for reuse (there are ways to extract text from PDF, although it’s usually not very pretty) Luckily, we were able to find another digital version.


This one’s in HTML. Not ideal, not by a long shot, but at least HTML provides some structure, and there is structure internal to the lines (you can see pipes separating dates, for example). Doug Emery, Special Collections Digital Content Programmer and the SIMS staff member responsible for Kalendarium, has been working with a collaborator in Brussels to generate a usable list from this HTML that we can incorporate into Kalendarium as the basis for our identification list.

Kalendarium prototype site

We have a prototype site up, it’s not public and it’s only accessible on campus now. We’ve been experimenting, you can see a handful of manuscripts listed here.

Kalendarium form

Similar to Collation 0.2, in Kalendarium you’re using the system to essentially build a version of your calendar. You can identify colors, and select saints from a drop-down list. Unfortunately we have already found that many saints that are showing up in our calendars aren’t in Grotefend, or they are celebrated on dates not included in Grotefend; but this is an opportunity for us to contribute to the list in a major way.

Why do this at all? Calendars are typically used to localize individual manuscripts – if we see that particular saints are included in a calendar, we can posit that the book containing that calendar was intended to be used in the areas where those saints were venerated. However, if we scale up, we’ll be able to see larger patterns: veneration of saints over time, saints being venerated on different days in different places, and we should be able to see new groupings of books as well.

Another set of projects SIMS is involved in, the Penn Parchment Project in 2013 and the Biology of the Book Project starting in 2014, involves testing the parchment in our manuscripts – literally reusing the manuscript, extracting data from the material itself. This involves taking small, non-destructive samples to gather cells from the surface of the parchment and testing them to see what type of animal the parchment is made from. Results are interesting; as part of the Penn Parchment Project, an individual who wishes to remain anonymous made expert identification of ten manuscripts from the Penn collection, and got only five of them correct. Clearly, parchment identification could benefit from a more scientific approach. More recently we have joined Biology of the Book, a far-reaching collaboration (including folks at University of York in the UK, Manchester University, The Folger Shakespeare Library, the Walters Art Museum, Library of Congress, University of Virginia, The Getty, and others) to begin the slow process of moving forward a much larger project with the aim to perform DNA analysis on larger numbers of manuscripts. Very little is actually known about the practices surrounding medieval parchment making, including the agricultural practices that supported the vast numbers of animals that were used to create the manuscripts that survive today (and, of course, all those that don’t survive). We think of parchment as an untapped biological archive, and a database containing millions of DNA samples would enable us to discover the number of animals used to build manuscripts, where those animals were bred (and how far they were imported and exported), what breeds were used – many questions that are simply impossible to answer now.

Mitch Fraas, Curator, Digital Research Services and Early Modern Manuscripts, creates maps and other visualizations relating to early books, and blogs about them at He’s used data from the Schoenberg Database of Manuscripts (which is available for download in comma separated format on the SDBM website, and is updated every Sunday) and data extracted from Franklin, the Penn Libraries’ catalogue, to generate some different visualizations, one of which is shown here: Charting Former Owners of Penn’s Codex Manuscripts.

Diagram: Charting Former Owners of Penn’s Codex Manuscripts (click for interactive version)

The yellow dots are owners, and the larger the dot, the more manuscripts the owner is connected to (Lawrence Schoenberg and Sotheby’s are quite large, as is Bernard M. Rosenthal, a bookseller in New York). Clicking an owner shows the number of manuscripts connected to that person or institution, and clicking a manuscript shows the number of owners connected to that manuscript. This visualization was developed using data from Franklin, and the blog post linked above provides details on how it was done.

Mapping pre-1600 European manuscripts in the U.S. and Canada

Just this week, for the 7th Annual Lawrence J. Schoenberg Symposium on Manuscript Studies in the Digital Age, Mitch has created a new map, Mapping pre-1600 European manuscripts in the U.S. and Canada, using data from the Directory of Institutions in the United States and Canada with Pre-1600 Holdings. This map shows the location of all holdings included in the directory. Larger collections have larger dots on the map. Clicking a dot will give one more information about the owner and the collection, and there are options for showing current collections or former collections, or for showing only collections with codices (full books, as opposed to fragments or single sheets).

Ms. Roll 1066: Genealogical Chronicle of the Kings of England to Edward IV, circa 1461

We have almost reached the end, but I would like to finish by featuring the project of last year’s SIMS Graduate Fellow, the brand new Dr. Marie Turner, which is still underway, and which is a great example of data reuse to finish on. Several years ago, Marie transcribed our Ms. Roll 1066, a 15th century genealogical roll chronicling the Kings of England from Adam to Edward IV. Her transcription was combined with images of the roll and built into a website, the screenshot here, with links between her transcription and areas on the page. But Marie’s vision is larger than this single roll. There are several other rolls of this type in existence, and her vision is to expand this single project, this silo, to not only incorporate other rolls, but to become a space for collaborative editing (transcription, description, translation, and linking) for the other rolls as well. We have successfully pulled the data from the existing site and converted it into XML, following the Text Encoding Initiative Guidelines, which we’ll use to generate the data we need to import into our new software system.

The new Rolls Project will be built in DM, formerly Digital Mappaemundi, an established tool for annotating and linking images, which has been developed by Martin Foys, a medievalist, and which has recently been brought to SIMS for hosting and continued development.

A screenshot of La Chronique Anonyme Universelle, edited by Lisa Fagin Davis, published in DM

This screenshot illustrates how DM looks in terms of linking annotations to areas of an image, and you can also link areas of images together. Just last week we got a production version of DM set up on our servers at Penn, and next week we’ll be importing our data – the data we exported from the earlier edition of Ms. Roll 1066 project – into that production version. We’ll also be importing images of a half dozen other genealogical rolls. We are immensely excited to move the Rolls project to the next phase – and it was all made possible by


I’d like to close with just a few thoughts about WHAT SIMS IS – and whether or not we are an effective object lesson for libraries supporting digital scholarship is probably up for debate. We certainly do scholarship, effectively, within the context of the library, and we do it ourselves: We are scholars, not service providers. However, I think it’s important to note that our scholarship, our research, our tools and our projects are not ends unto themselves. They will all serve to support more work, to allow other scholars to ask new questions, and hopefully to help them answer those questions.
Since we are not service providers, faculty and graduate students aren’t our clients, they are our collaborators, our equals, our partners. We are in this together!
Finally, and I could have said more about this throughout my talk, we take pride in our data. We want data from all of our projects – all the data that we have reused and brought in from other places – to be consistent, with regard to formatting and documentation, accessible, in the technical sense of being easy to find, and reusable, with regard to both format (it is unlikely you will find PDFs as the sole source for any information on our site) and license. Likewise our code; we make use of Github (a site for publishing open source code) individually and through the Library’s account, and all our code is and will always be open source.

Thanks so much again, and I’m happy to take questions now.

Leave a comment

Visualizing the Construction of Manuscripts, through Collation and Video (DigiPal IV Symposium)

It’s been a month now since the fabulous DigiPal IV Symposium, and I’ve been meaning to share the video of my own contribution to that event since I returned to Penn in early September. My talk is “Visualizing the Construction of Manuscripts, through Collation and Video,” and introduces two projects that we are actively undertaking here at SIMS. The first is a visualization system for the physical collation of medieval manuscripts (see some example results, and our slightly out-of-date source code on GitHub), and our ongoing project to create videos about manuscripts in our collection.