• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Tyler Shoemaker - Text Analysis Exercise

Page history last edited by Tyler Shoemaker 9 years, 7 months ago

The computers in the English Department’s CRC don’t require log-ons and are connected to a printer. This means one thing: thousands of Word and PDF files have been saved to each one and printed off. Saved and printed, but not always deleted. (Zizek’s comments on toilets seem particularly apropos here -- our computer usage is very “French.”)

I spent part of an afternoon going around to each machine, opening every Word doc I could find, and collating the data into one big file. Total word count: 335,821. Document count: I lost track halfway through the first computer. When I finished I scrubbed the data and used it as my source for this practicum.

I would’ve loved to put everything through Poem Viewer, but it wasn’t feasible. Lexos too would’ve been interesting, but there was just too much (showing that these devices are, like Moretti says of 'Theory,' anything but plug-and-play*). I settled instead on the Voyant suite, using a mix of Cirrus, Corpus Reader, and Word Trends to create a few word clouds and charts. A first round of analysis revealed the need for a stop word filter to get anything meaningful (meaning meaningful in the way I wanted it to mean), and after applying one, I came up with this:



Some of the more popular words like “university” and “press” caught me by surprise until I remembered copying 45 pages of a medievalist’s article somewhere around computer three. I edited the filter to exclude what I thought would be ‘bibliographic’ markers, thinking I could slough off academic apparatuses while preserving ideas. These included:

  • university
  • press
  • pp
  • cambridge
  • york
  • london
  • oxford 

I also added “medieval” to the list, knowing the filter would be a little more violent than I intended it to be, but I wanted to make some room for other trends. That one article makes up almost five percent of my entire data set and I felt I could even the playing field with the sacrifice. I’m not wholly to blame for this; plugging the file into Overview made our medievalist’s monopoly very apparent and helped sway my decision.




After consulting with Overview and applying the filter, I had this:



We like our ideas new and, to go out on a limb, very much centered on the individual (time, experience, affect, self, etc). This isn't altogether surprising, given the intellectual ecology of a graduate program in the humanities. All the same, I couldn't resist tracking these terms on the Ngram viewer, just to see how they'd compare:



Ironically, over the last two hundred years the usage of these terms has remained about as steady as any stop word's.


And finally, though I don't think any correlative evidence can directly substantiate this, it might behoove us to mediate for a moment on a few words that are all but missing from these private/public files: confidence (9), privacy (7), reputation (4), security (4), hack (0), piracy (0), identity theft (0) . . .



* "It happens, there are un-mappable forms . . . , and these setbacks, disappointing at first, are actually the sign of a method still in touch with reality: geography is a useful tool, yes, but does not explain everything. For that, we have astrology and 'Theory'" (Graphs, Maps, Trees).



Comments (0)

You don't have permission to comment on this page.