Beavan, D., ‘Visualisation of textual data through collocate clouds’. Poster presented at Finding the Hidden Knowledge, University of Glasgow, 21-22 February 2008

The Corpus of Modern Scottish Writing (CMSW) is an Arts and Humanities Research Council funded project. Its goal is to collect written documents from Scotland, covering the years 1700-1945. Over this time the linguistic landscape in Scotland changed significantly, with English slowly becoming dominant over Scots.

One major issue with Scots is that of variant spelling, with authors often using very personal or dialectal spellings e.g. home = hame = haim, or potato = tattie. While this makes the language very rich, it becomes problematic when attempting to browse, search or retrieve information from the resource. There are a number of techniques which have led to partial solutions, such as letter substitution or pronunciation similarity, but nothing has proved reliable. This problem occurs in many other domains, where different authors refer to the same entity by different means; perhaps the techniques used in text mining could offer a way forward.

Browsing data and finding connections between words is an important step in language research. While working for CMSW’s prior project, the Scottish Corpus of Texts and Speech (SCOTS), I developed the following visualisation.

Focussing on a particular term (node word) the entire corpus is searched, finding all its collocates (surrounding words) by retrieving the five words preceding the node word, and five words following it. These are then aggregated and presented to the user in cloud form, listing the one hundred most frequent collocates in alphabetical order, showing the frequency of them by varying the font size. In addition, collocational strength (the likelihood of two words co-occurring) is shown as brightness. Therefore collocates which are large and bright are found frequently and principally around the node word. These clouds promote browsing of the resource, as each collocate can be clicked on to form the node word of a new cloud. This allows the user to explore the language used in the corpus, and how terms interact with each other. While primarily aimed at language research, this visualisation may be useful to the text mining community.