Data Cleaning and Preparation

Earlier today Ben Jackson gave the first in this semester’s series of digital methods open workshops. Here are a few rough notes on what we covered. If you missed the workshop and want to try out some of it on your own, you can find the tasks here. For details of forthcoming workshops, go here. All workshops are free and open to everyone (but it helps if you register).

Ben started us off with a rapid hurtle through some of his recent and ongoing projects (slides), including his collaboration with Caroline Bassett exploring ways of analysing and visualising Philip K. Dick’s writing (counting electric sheep, baa charts), and work bringing to life the text data of the Old Bailey Online. (It’s a tremendously rich archive of nearly 200,000 trials heard at the Old Bailey between 1674 and 1913). Bringing to life, and also bringing to unlife: Ben uses a kind of estrangement effect to remind the observer of what the data isn’t telling us, populating his legal drama puppet show with a cast of spoopy skellingtons.

Ben Jackson puppet show

Most of the workshop was a free exploration of prompts and tools Ben pulled together. People basically tried out whatever they liked, while he glided from table to table rendering assistance.

Calibre is a free ebook manager that is also a bit of a Swiss army knife, and it just so happens one of its fold-out doohickeys is a very good ebook-to-plain-text converter. Ebook files (.AZW, .EPUB, .MOBI etc.) are stuffed with all kind of metadata that usually needs to be cleared away before you can do any analysis on the raw text itself. We also did something similar with another free tool, AntFileConverter, turning PDF into plain text. The lesson was that documents can be ornery and eccentric, and different converter tools will work differently and give rise to different glitches: “no single converter that will just magically work on every document.”

AntFileConverter is part of a family of tools. We also checked out TagAnt and AntConc. I feel like I only scratched the surface of these. TagAnt creates a copy of a text file with all the grammatical parts-of-speech tagged. So if you input something like “We waited for ages at Clapham Junction, with the guard complaining about people blocking the
doors” you get something like “We_PP waited_VVD for_IN ages_NNS at_IN Clapham_NP Junction_NP ,_, with_IN the_DT guard_NN complaining_VVG about_IN people_NNS blocking_VVG the_DT doors_NNS ._SENT” as output. PP is a personal pronoun, VVD is a past tense verb, IN is a preposition or subordinating conjunction, and so on. By itself this just seems to be an extremely pedantic form of vandalism. It does let you fairly easily find out if, for example, an author just loves adverbs. And tagging parts of speech could be the first step toward more interesting manipulations, for creative purposes (shuffle all the adverbs) and/or analytic purposes (analysis of genre or authorship attribution).

AntConc allows you to create concordances. A concordance is (more or less) an alphabetical list of key terms in a text, each one nestled in a fragment of its original context. So it’s a useful way to browse an unreadably large corpus based on some particular word (and so to some extent some particular theme) that interests you. Sure Augustine had stuff to say about sin and grace, but what did he think about, I don’t know, fingers?


So a concordance helps you to find sections you might want to read more thoroughly.  But I guess it doesn’t just have to be used like that — like a kind of map, or a very comprehensive index — but could also be read in its own right, and that reading could comprise a legitimate way of encountering and gaining knowledge of the underlying text.

How might, for instance, reading every appearance of the word “light” constitute its own way of knowing how the term “light” is working within a text? Are such readings reliably productive of knowledge? Or is it more like you might get lucky and stumble on something intelligible, like how a particular word is being tugged in distinct, divergent directions by two different discourses it’s implicated in?

How do these tools actually work? Well, going by the name and a logo, a really fast clever ant just does it for you. Thanks ant!

AntConc screenshot


Voyant Tools is a web-based reading and analysis environment for digital texts. What does that mean in practice? When you feed it your text file, a bright little dashboard pops up with five resizable areas. Each one of these contains a tool, and you can swap different tools in and out. I’d guess there are about fifty or so tools, although I’m not sure how distinct they all are really.

Voyant Tools screenshot 1

At least one tool was very familiar: “Cirrus” in the top left corner makes a word cloud of the text you’ve inputted, with the most frequent words appearing the largest. Very common words like “a” and “the” are filtered out (in the lingo, they are “stopwords”). The bottom right tool, “Contexts,” was also pretty familiar, since it seems to be a concordance, like we’d just been doing in AntConc. “Summary” and “Trends” were pretty self-explanatory. “TermsBerry” required a bit more poking and prodding. It clusters the more frequent words near the middle, the rarer words round the edges. When you hover your mouse pointer over a word, some of the other drupelets light up to show you what other words tend to appear nearby. You can mess with the thresholds and decide exactly how close counts as “nearby.”

The “Topics” tool looks interesting. It starts with random seeds, then builds up a distinct word cluster around these seeds based on co-occurence and then tries to work out how these word clusters are distributed throughout the text. Each word cluster (or “topic”) technically contains all the words in the text, but each one is named after the top ten terms in the cluster. A few of these seem knitted together by some strong affect (“bed i’ve past lay depression writing chore couple suffering usually”) or a kind of prosody or soundscape (“it’s daily hope rope dropped round drain okay bucket bowls”). Others feel tantalisingly not-quite-arbitrary, resonant with linkages in the same way a surrealist painting is (“bike asda hard ago tried open bag surprisingly guy beard”). But I’m not sure how far I trust my instincts about these artefacts, and I definitely don’t yet know how they might be used to deepen my knowledge of a text, or how they relate to various notions you might invoke in a close reading (theme, conceit, discourse, semantic field, layer, thread, note, tone, mood, preoccupation, etc.).

The various tools on your Voyant dashboard also seemed to be linked, although I didn’t get round to fully figuring that out. Definitely whenever I clicked on a word in the “Reader” tool the other displays would change. Oh: and Voyant Tools seems to be pretty fussy, and didn’t want to run on some people’s laptops. I didn’t have any trouble though.

I got a bit sucked into trying to work out what the “Knot” tool does — it’s this strange rainbow claw waving at me — and didn’t spend much time on the last exercise, which was about regular expressions (or regex). Basically, these are conventions which let you do very fancy and complicated find-replace routines. You can search something like ‘a[a-z]’ which will match aa, ab, ac, ad, etc. Or (one of Ben’s examples) by replacing <[^>]+> with nothing, you can clear out all the XML tags in a text document. You can use regular expressions in plain old Word (just make sure you check the box in the find-replace dialogue), but they probably work a little better in a text editor like Atom or Sublime Text.

“The purpose of this part of the task is to teach you how to use them, not to teach you how to write them.” Phew! For me, regular expressions never seem to stick around very long in my memory, but it’s very useful to know in broad terms what they’re capable of. Every now and then a task pops up in the form of, “Oh my God, I have to go through the whole thing and change every …” and that’s my cue to start puzzling and Googling and figuring out whether it can be done with regular expressions. If it can, it will probably be quicker and more accurate, and it will definitely be more satisfying.

So: plenty explored, plenty more to explore. And I’m looking forward to the next workshop, Archival Historical Research with Tropy, on 19 February.