Text Mining

Unlike some of the weeks of this course, text mining was something that we had actually done before. (For those interested, the results and process are here through several blog posts.) However, When we did that work in February, we (or at least me) were really new to text mining. Through Programming Historian, we were able to work our way through a version of text mining that gave us some sort of results. The problem with this process was that since we were new to the process, we 1.) missed a few technical steps and 2.) did not really get to read into some of the practicalities of text mining.

This week’s readings help to flesh out some of the missing pieces from the project, as well as understanding the complications, such as the double meanings of words and how words can be entirely different to work with than numbers. Historians, by nature, are very comfortable with working with words and discovering their underlying meanings throughout the context. However, there can be complications, such as dealing with figurative language. Traditionally, we read within the context to understand what these things mean. With a large corpus of texts, it can be useful, and sometimes even insightful, to see what types of patterns exist.

One of the issues that I ran into with topic modeling and the process is how to coax out silences that historians are used to dealing with in historical texts. I am, however, unsure of how to represent those silences or predict where they would occur.

Further, a common theme that has occurred in the readings for digital history is the need for legitimacy– to explain how these things work so non-digital historians might understand what we did. Ted Underwood states that we need to understand the black box behind text mining, such as the algorithms. However, we discussed in the class how monographs and historical work tends to cut out the methodology now. Why is it important for digital historians to explain their work, whereas traditional historians do not? I do feel that is important to understand the concepts and ideas behind topic modeling, but I feel that the constant need for explanation and legitimacy could potentially limit the projects that could emerge from digital history.