Kay, C., ‘Issues for Historical Corpora: first catch your word’. Paper delivered to AHRC Methods Network Expert Seminar: Linguistics. University of Lancaster, 8 September 2005
The Historical Thesaurus of English (HTE) is a semantic index to the Oxford English Dictionary (OED) supplemented by Old English materials published separately in A Thesaurus of Old English (TOE). Word senses are organised in a hierarchy of categories and subcategories, with up to fourteen levels of delicacy. The material is held in a database and first steps towards internet publication are being taken by an AHRC-ICT Strategy Project creating searches for use in a range of humanities disciplines. The main problem which besets searching historical texts is that of variable spelling – the further one goes back in time, the worse it gets. Similar problems affect texts in non-standard varieties, as experience of the Scottish Corpus of Texts and Speech (SCOTS) and the Dictionary of the Scots Language (DSL) demonstrates. Dictionary headwords lemmatize common variants but are by no means comprehensive; an alternative may be a rule-based system which predicts possibilities. Corpora have further problems in that lemmatization may not solve problems of homonymy and polysemy. The paper will suggest ways of addressing these problems using the resources described above.

