| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Patrick - Topic Modeling

Page history last edited by Patrick Mooney 9 years, 5 months ago

 

"The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents." — H.P. Lovecraft, "The Call of Cthulhu" (1926)

 

I installed MALLET on my computer (this is surprisingly non-intuitive under Linux, and was for me an excellent example of prior knowledge being obtrusive in learning a new program). After getting everything installed and working and running through the Programming Historian tutorial, I selected a group of texts to play with, something I had lying around on my hard drive: 10 stories by H.P. Lovecraft, extracted from this collection on Project Gutenberg. Of the nearly seventy stories included, I picked two that I know well ("Facts Concerning the Late Arthur Jermyn and His Family" and The Shadow Over Innsmouth — I've taught them to undergrads and the appear on the reading list for the first chapter of my dissertation) and eight more chosen on the basis of the fact that (a) I'm more or less familiar with them, and (b) they more or less cover the major branches of Lovecraft's writings that his fans tend to be familiar with. (Incidentally, these texts are selected from the same corpus that my automatically generated experimental blog The Worst of Bad Lovecraft — currently experiencing technical problems that I haven't had time to solve, alas — draws from.) I saved each to a separate text file in a separate directory, then I imported them into mallet with

 

bin/mallet import-dir --input lovecraft/ --output lovecraft.mallet --keep-sequence --remove-stopwords

 

and then started playing with the .mallet file in various ways, including looking for an optimal number of topics. I initially tried 20, as in the tutorial, but they didn't seem to be a coherent set — I couldn't easily characterize any of them in any other way than by noticing that many of the topics were clearly centered around a particular story (the "Arthur Jermyn" cluster, the "Innsmouth" cluster, the "Herbert West" cluster ...). So I tried larger and smaller numbers, but the sweet spot seems to be about 25 in this case: there are still identifiable clusters around most of the stories, but there are also more general clusters. Running it with a small number of topics seems (to me) to produce uninterpretable results that say little more than "H.P. Lovecraft wrote some creepy things." (Well, we knew that already. He was a creepy guy.) But running it with a larger number tends to produce equally uninterpretable results. Or maybe it's just that I'm not yet sufficiently conversant with interpreting the results of mallet runs and/or tweaking the search parameters (I haven't yet played with the "hlda" command that the tutorial mentions briefly). But at about 25 topics over the corpus of ten stories, themes seem to emerge in at least some clusters. So I exported a dataset using

 

bin/mallet train-topics --input lovecraft.mallet --num-topics 25 --optimize-interval 1000 --output-state lovecraft-state.gz --output-topic-keys lovecraft_keys.csv --output-doc-topics lovecraft_composition.csv --word-topic-counts-file lovecraft_word_topic_counts.txt

 

which gave some interesting results. I've zipped up the relevant files into, well, a .zip file: lovecraft.zip

 

and visualized the results as a word cloud, as Jockers did, by using Lexos (hence the --word-topic-counts-file switch), as described in this blog entry by Scott Kleinman. Here are the clouds:

 

These clusters actually say a lot, and a number of things jump out at me here.

 

One is that certain topics are certainly still clustering around certain stories. For instance:

 

  • Topic 0 is clearly a topic describing The Shadow Over Innsmouth, and this is notable even aside from the fact that the most prominent word from the cloud appears in the title. Several other prominent words, for instance, are lower-cased proper names that appear repeatedly in that story: "eliot," "dagon," "zadok," "walakea." Also notable is the prevalence of heavy representations of New England spellings that are particularly prominent in that particular story, but don't occur particularly often in most of the other stories in the group of texts being analyzed: "aout," "ye," "ud" (for "would"), "agin," "sech," "feller," "jest" (not a joke in this case, but rural Massachusetts for "just"), "taown," "seed" (an irregular past participle here). Topic 14 is also an Innsmouth topic, for the same reasons ("marsh" here is a last name, not geographical description). In fact, looking at the lovecraft_composition.csv file shows that these two topics together account for almost 30% of the story. Topic 19 is almost as prominent as 14, comprising just over another 7% of the story and having similar thematic concerns, though more diffusely so. But topic 19 is also a prominent component of two other stories and will be discussed in more detail later.
  • Topic 2 is clearly the "Arthur Jermyn" cluster, describing a story particularly concerned with ancestry and having a very troubling relationship to race. Notable here are the ancestral terms, since this story (note the titular end "... and his family") is concerned with descent and biological race. Also notable are the names of Jermyns — here's a family tree I threw together for a slide show when I was lecturing on it:

    In this case, examining the lovecraft_composition.csv file shows that this cluster alone accounts for 39.8% of the topics in the story — the next largest topic is topic 24, which accounts for only about 8.38% of the story (and note the prominent presence of the word "black" in that cluster, as well as prominent genealogical words ("child," "children") and words the story uses to describe physiognomy ("face," "head," "parts," "arms," "appeared," "ghoulish," "slope").

  • To avoid belaboring the point, I'll just point out quickly that, even without going through the data, and just from looking at the word clouds, it's apparent that topic 6 is a cluster centering around "The Dunwich Horror" (names of the Whateleys, and the family name as the most prominent word in the cluster; again, the emergence of dialect renderings); topic 8 is the "Erich Zann" cluster (the name, and musical terminology); topic 9 is the "Herbert West" cluster (again, the name; also, words associated with the Frankenstein theme of the story: technical terminology from anatomy and the concern with the educational setting); topic 13 is the "Call of Cthulhu" story (multiple names are indicators here this time, since "Cthulhu" occurs a fair amount in HP Lovecraft stories; the settings; the thematic and symbolic elements: "museum," "hieroglyphics," "idol"); topic 18 is the "Cats of Ulthar" cluster ("cat" and "cats" are prominent, a dead giveaway here: cats are not a large thematic concern of Lovecraft's; the setting: "remote" "cottage"; the main characters: "wife" and "wanderers").

 

I admit that until I looked at the data, it was not immediately apparent to me that topic 17 was a cluster centering around "The Lurking Fear," but on second glance it clearly is: "thunder," "lightning," and "tempest" are major themotivatic concerns and plot-event motivators there, and the words "lurking" and "fear" occur prominently; there's also a group of underground-related words ("digging," "mounds," "underground," "tunnel"), and underground is where the terrifying creatures live in that story. Similarly, I missed the fact that topic 21 is clearly the "Polaris" cluster, despite the clearly astronomical cast of the topic: "horizon," "overhead," "quarter," "north," "plateau," "peaks," "pole." (But in my defense, I haven't read "Polaris" in a long time, possibly since high school, and I only picked it out to have a tenth text.)

 

Looking at the data clarifies a number of things. Here's a slightly cleaned-up table with each story and its two most prominent topics, with their percentages rounded off to three significant digits:

 

Story TitleMost common topicMost common topic frequency2nd topic2nd topic frequency
"Facts Concerning The Late Arthur Jermyn And His Family" 2 39.8% 24 8.39%
"The Dunwich Horror" 6 21.9% 20 9.59%
"The Doom That Came To Sarnath" 7 56.9% 19 10.3%
"The Music Of Erich Zann" 8 35.1% 24 14.0%
"Herbert West, Reanimator" 9 28.5% 15 10.5%
"The Call Of Cthulhu" 13 23.3% 19 11.4%
"The Shadow Over Innsmouth" 14 15.8% 0 13.6%
"The Lurking Fear" 17 25.2% 5 12.8%
"The Cats Of Ulthar" 18 50.9% 24 12.8%
"Polaris" 21 43.8% 5 9.28%

 

I take this data to have several implications:

  • There aren't many thematic elements contributing prominently to the individual stories. For half of the stories on this list, the top two topics alone account for 48% or more of the stories in question; for three stories, the top two topics account for more than half of the stories' content.
  • In all but one story (Innsmouth), the top two topics account for at least 30% of the story's content, and Innsmouth comes close at 29.43%; Innsmouth is also the longest story, at nearly 150K of plain text, and one might reasonably expect a longer story to contain more prominent topics than a shorter one (i.e., the longer it is, the more it has to contain to keep reader interest and to produce a horrifying effect).
  • In two stories ("The Cats of Ulthar" and "The Doom That Came to Sarnath"), the top two topics account for over 60% of the story's content (63.71% and 67.25%, respectively). These are both short stories (7.2K and 14.6K of plain text, respectively; the first and third shortest stories in the group under analysis).
  • In fact, there's a pretty strong negative statistical correlation between (a) the length of the story in question, and (b) how prominently its top two topics are represented. Here's a table:

 

Title Prominence of First Two Topics File Size (KB)
"The Shadow Over Innsmouth" 29.463% 147.8
"The Dunwich Horror" 31.453% 98.2
"The Call Of Cthulhu" 34.732% 67.8
"The Lurking Fear" 37.986% 42.3
"Herbert West, Reanimator" 39.040% 69.6
"Facts Concerning The Late Arthur Jermyn And His Family" 48.200% 20.9
"The Music Of Erich Zann" 49.101% 18.9
"Polaris" 53.069% 8.1
"The Cats Of Ulthar" 63.710% 7.2
"The Doom That Came To Sarnath" 67.248% 14.6

 

And here's a scatter plot:

 

  • Lovecraft's thematic concerns actually vary quite a bit from story to story, at least insofar as MALLET is able to determine them. Only topics 5, 19, and 24 are repeated in the table indicating the top two themes, above. (But these three topics appear in seven stories, leaving only 3 that have no topical overlap with the others in the top two topics for each story.) Horror writers are often represented by detractors as writing the same damn thing over and over, but there seems to be a lot of variation here. (This is admittedly a rather simplistic interpretation, and should really be supported by further analysis of what's actually in each individual topic.)

 

Looking at the repeated topics says something about the general thematic trend of the stories as a whole, I think:

  • Topic 5 contains the prominent words "time" and "night," as well as "day," plus a variety of nature-related words: "stream," "wind," "moon," "nature," "air," "light," "ground." (I exempt "plain" from this group because it would be too much work tonight to determine whether it's being used as a geographical feature or an adjective -- my hunch is that it's most often the latter. Similarly, it would be too much work to determine whether "rose" is the flower or the past tense of the verb "rise," though again, I suspect the latter.) Some of the remaining words fit into a broader general pattern of indicating "setting": "house," "mill," "region," "place." This topic has interesting placement in the ranked list of topics for each story: It occurs as the second most prominent topic twice, and usually in the top eight, but in one story ("The Call of Cthulhu"), it is the twelfth most common topic (but never lower, out of the 25 identified topics), and on average, it is ranked 5.5 out of the 25 topics identified. I take this to mean that setting is a moderately important background concern for the stories in question, and that the particular cluster of words that represents it shows Lovecraft's creative debt to Poe's American "Dark Romanticism." More, there are two other general types of words occurring in this thematic cluster that are highly suggestive: "eyes," "find," "notice," "visible" suggest that the concern is specifically with an active agent perceiving a setting, and these are modified by some adjectives and adverbs that suggest a mode in which the perception occurs: "distant," "wholly," "lines," "narrow," "absence," "fact" all suggest the detached, analytical view of nature implied by abstract, industrial Western scientific thinking. Of course, Lovecraft's overarching project becomes clear with the numerous emotional modifiers attached to this cluster: "horrible," "lone," "fear," "shot," "left" (behind, is my hunch), and "iron."
  • Topic 19 might be designated the "patriarchy" cluster: "great," "men," "city," "spoke," "heaven," "pillars," "art" and "artists," "worship," "manuscript," "day" and "time" and "year," "notes," "letters," and "words" all fit this pattern. There is also a concern with patrilineal descent: "elder," "young," "aged," "born," "mortal," "died," "native" -- and with the way that knowledge was transmitted: "spoke," "showed," "found," "ears," "clear." But this is not an unqualified endorsement, but rather an uneasy ("vaguely," "half," the qualifier "whilst") anxiety about its displacement ("unknown," "silence," "strange," "aspect," "bizarre").
  • Topic 24 might be said to be the most obviously "Lovecraftian" cluster, indicating the particular plot-mechanical devices that Lovecraft helped to solidify into the canon of hoary horror tropes: "nearer," "door," "room," "sound" and "sounds," "fire," "missing," "dark" and "black," "lights," "call," "heard," and that which provokes terror because it is very "large" (a particularly prominent thematic concern for Lovecraft -- the back side of the Romantic sublime, in fact). And there are the keywords indicating the traditional horror setting and its problems: "amidst," "room," "slope," "nearer." The parts of the body mentioned are those that are both vulnerable and closely connected to identity and interpersonal connection, especially in the context of romantic and sexual love: "lips," "face," "head," "hand," "arms"; these provide a reading of a related word, the fragile human connection indicated by "touch" (precisely what is punished in much later high school slasher horror movies). Also prominent are words associated with the emotions supposed to be provoked by the horror genre: "terror," "wild," "dare"; the typical "manner" in which horror achieves its effects and by which it advances its plots: "suddenly," "finally," "strange," "stay" and "sat" (almost always mistakes), "met" (such a necessary plot device for the genre), "utter" (I suspect this is most often a verb in these texts), "sleep," "guided," "ascent" (Lovecraft is here the antecedent of the writer who sends the ditzy heroine up the stairs while she is being chased, though his protagonists are more likely to be male and seeking rather than fleeing). Too, there are the traditional larger-scale thematic concerns for the genre, the "what's really at stake" keywords: "exist," "truth," "memory" and "reason" (precisely what is threatened by Lovecraft's supernatural forces), "man" (in the double sense of the masculine taken as the default human), "perfect," "alive" (often itself a source of horror in Lovecraft's fiction).

 

Only topics 1, 3, 4, 10, 11, 12, 16, 22, and 23 do not appear in the top two topics for any of the stories in question. These I take to indicate recurring background concerns and devices in the texts that are currently under the lens. To provide a quick sketch of potential readings here:

  • Topic 1 expresses a tension between traditionalist agricultural customs and modernization. ("iii," alas, is just a Roman numeral commonly used as a heading, though I wonder whether this topic tends to occur in the third section of various stories. Again, more attention needed here.)
  • Topic 3 is another knot of traditional horror settings, plot devices, and concerns.
  • Topic 4 is primarily concerned with genealogy, race, epistemology, and the ways that these concepts work themselves out in Lovecraft's stories. (Again, there's a structural debt to Poe indicated here: "the house of Usher" is both the decaying ancestral manse and the family as an institution.)
  • Topic 10 is concerned with epistemology as expressed in scientific-rational discourse, with the university as its exemplary setting, and with the consequences of the epiphanic realization of the limits of scientific epistemologies. (Again, "ii" is a Roman numeral heading.)
  • Topic 11 is much like topic 10, but critiques these concerns through juxtaposition with rural, uneducated people who have a rough and terrible wisdom passed down from generation to generation.
  • Topic 12 takes topic 10 and transposes it more directly onto what we might think of as a "science fiction" domain: the limits of human knowledge are critiqued not through homespun-though-terrible folk wisdom, as in topic 11, but rather by pushing the limits of scientific knowledge past what humans can understand. (Interestingly, never does more than one of topics 10-12 appear in the first nine topics for any of the stories under consideration.)
  • Topic 16 might plausibly be read biographically as the development ("increasingly") of an antifeminist train of thought: "woman," "home," "half," "voice," "ways" (might be said to) fit this first cluster, with a variety of negative judgments providing a justification for my "antifeminist" claim. Flogging the biographical horse a bit more, we might argue that this misogyny was motivated by an early life largely dominated by his mother and aunts, who raised him after his father and grandfather passed away; he spent much of his life a recluse in their home, influenced "evening," "morning," "general[ly]," "complete[ly]." "Boston" would then be justified in its inclusion in the cluster as the location at which he attended a journalists' convention several days after his mother's death, which greatly expanded his social circle and coincided with the beginning of a period of increased creative output that led up to what might be thought of as his "mature" writing phase. (But this feels to me the least cohesive reading here.)
  • Topic 22 might be taken to once again express a tension between the epistemological claims of traditional religion and modernizing scientific thought.
  • Topic 23 might plausibly be read to express anxiety produced in response to the crossing of (generally never questioned), which I tend to see as a major characteristic of horror. More specifically, in this case, the boundaries in question are biological and generational boundaries. It appears prominently (i.e., top 5 topics) in only 3 of the stories under consideration: "Herbert West, Reanimator" (the eponymous character is a college Frankenstein); and "Arthur Jermyn" and "The Lurking Fear," both of which are concerned with human-ape hybridism.

 

And I think that's that for tonight. See you all tomorrow afternoon!

Comments (0)

You don't have permission to comment on this page.