Anderson, W., ‘Littles mak mickles: issues in building a general corpus for Scotland’. Paper delivered to the St Andrews Institute for Language and Linguistic Studies, University of St Andrews, 8 March 2005

The Scottish Corpus of Texts and Speech (SCOTS) Project has set itself the goal of compiling a 4 million word corpus of texts in the languages of Scotland, focusing in the first instance on varieties of Scots and Scottish English, and sampling both written and spoken language, the latter being made available as orthographic transcriptions synchronised with the source audio or video file. Versions of SCOTS have been accessible on the Web since November 2004, and regular additions are made to the Corpus as texts are processed and functionality is improved.

This paper will consider the theoretical and practical issues involved in building a publicly-available general corpus such as SCOTS. These include non-standard written language and spelling variation, the representation of language varieties and text types, web accessibility and the provision of tools for linguistic analysis. Once these hurdles have been overcome, however, the SCOTS Corpus will be a significant resource for linguistic and cultural study, and a valuable record of the current-day linguistic situation in Scotland.