Fun with Topic Modeling

Found poetry?

“kiss dark torn sea crenellations gods walk disappointment eluded colours”

The ten words above represent a “topic” generated by the MALLET Topic Modeling Tool from a 546 word test file that I put in called “Poem ideas.txt” The tool generates topics by searching for significant clusters of words (you can find out how that works in more detail here and here) in the texts you put in. Then it makes spreadsheets that show you the topics and how relevant they are to each “doc,” a chunk of text that can vary in size from a few sentences in one file to the entire contents of multiple files.

Whizzbang!  you might think, but what good is it? And if you’re a) a careful poet, b) an anti-poet, or c) not a fan of abstract art, you might add, It’s just ten random words with loose associations.

“A hit, a very palpable hit!” The TMT is not meant for small inputs; it is more suited to dealing with inputs like all of gothic fiction, or ten thousand emails, or issues of National Geographic from 1960-2014. Nor does the quoted string of words say anything particularly new or particularly well. The answer to the question, “What good is this ‘found poetry’?” is that it is no good at all, either from the statistician’s perspective, or from a conservative poet’s perspective, like my own.

It is, however, a lovely surprise. In the last place I thought to look for the ingredients of an epic poem, I found conflict in relationship (kiss, torn, dark, disappointment), characters larger-than-life (gods, walk, eluded), and environment (crenellations, dark, sea, colours). The funny thing is that this cluster doesn’t really represent what it appears to represent: a coherent story. The text it models is a disconnected collection of lines I thought up. I’m not even sure how crenellations made the cut – I don’t believe I’ve used that word more than once in the whole of my poetic endeavors, let alone the single input file.

If you find yourself nodding at the connections I drew between the words of the cluster, you can probably imagine the program’s usefulness as a heuristic: not just for literary critical argument – digital humanities is all over that – but for creative writing. TMT has adjustable features, like the number of topics you want to display, the topic proportion threshold, and words you want the program to ignore (articles, prepositions and proper nouns, maybe). Here are some of the other topics I’ve generated while playing with settings, file types and larger sets of files:

  • sky mark crash deafening drunk cat curled hoofbeats hounds trumpets
  • encyclopedia trevisa medieval principles greek roman memory bowker type detail
  • settings vk mc li styleswitheffects zx nk gg jvm fj
  • body somme work hondes qualitees touchinge goode liknes ordre litil
  • fear kingdom eliot print stars norton fairy harder context made

For a while now my creativity has needed a kick in the arse, and I’m tickled pink that it was a thing so heavily invested in numbers that did it. Honestly, I shouldn’t be surprised; numbers have been kicking my butt since I met them, but hey. This is cool.

Advertisements

3 thoughts on “Fun with Topic Modeling

  1. Heidi says:

    Crenellations is a good word.

    I’m curious what your thoughts are on the whole digital humanities effort. A friend of mine did a lot of work in digital humanities at U of C, and she always had mixed feelings about it. I don’t know much about it beyond what I learned vicariously through her, but we had a lot of conversations about that kind of an approach to literature.

  2. jemzlinde says:

    I think the strength of the DH approach is that it forces critics to think hard about their methods. For example, Franco Moretti points out that many of the sweeping statements critics make (like, “the gothic novel developed in such and such way from x year to y year”) are based on a tiny percentage of the pertinent books. It’s not humanly possible for one person to read all novels that can be defined as gothic. So if you want to make an argument about developments in the genre without using computers, you’re choosing an awfully small sample size to base an argument on. Risky business. Computer reading comes with its own risks – not all novels are digitized, character recognition (for transforming images into text files) is never perfect, etc – but to me seems more credible than arguments based on tiny sample sizes. Not sure I agree with Moretti’s further conclusion that we should start close reading different units of literature than ‘texts’…but it’s an interesting
    thought.

    DH bores me, on the other hand, when it claims to usher in a new age of criticism or transcend a hermeneutic understanding of literature – I mean the traditional approach in which you analyze things to find out what they’re saying and how they say it so well. I’ve never read an article that really abandons the Aha! moment of “This has been going on all along! I Get It.” In my opinion DH is only unique in that the Aha! discovery is often about method first, and text second.

    I’m glad you asked – I realize my understanding of the theory and practice is a lot fuzzier than I knew. What are your thoughts?

    • Heidi says:

      I’ve never thought about it that way, in terms of talking about larger-scale development and so on. I think that’s a good point.

      My fear is that DH will fundamentally change reading practices. There’s something magical about the reading experience, and I don’t want it to become something mechanized and statistical, where readers rely on a computer to point out themes and patterns. Books have a human message, and readers should be able to recognize their humanity without the aid of a machine. Students need to be taught to read well. I don’t want DH to dehumanize the “humanities” part of itself.

      I tend to be too sentimental about these sorts of things, so perhaps I’m overly critical of DH. I don’t know a lot about the DH approach as a whole, or how DH intersects with more traditional hermeneutic methods.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s