Beavan, D., Kay, C. and Anderson, J., ‘The Scottish Corpus of Texts and Speech’. Paper delivered at Sociolinguistics Symposium 15, University of Newcastle, 1-4 April 2004

The Scottish Corpus of Texts and Speech (SCOTS) is a research project at the University of Glasgow, with EPSRC start-up funding. It is the first large-scale project of its kind in Scotland. Our aim is to give a full picture of the complex linguistic situation which exists in Scotland today, with a primary focus on Scottish English and Scots. A second phase is planned to include all the languages of Scotland: Gaelic and non-indigenous community languages such as Punjabi, Urdu and Chinese. The corpus will be freely available on the Web for research and teaching.

SCOTS is a synchronic corpus, a 'snapshot' of the languages used in Scotland in recent years, and a monitor corpus which will be continually updated. It includes written texts, sound recordings, video recordings, and transcriptions of sound from the latter two. It also contains extensive sociolinguistic metadata. We are processing readily available materials now and will identify the gaps and find or commission new materials to fill them. We intend that the corpus will contain upwards of 4,000,000 words, at least 20% spoken. This will form a valuable research tool in its own right, and will offer a structure flexible enough to expand and accommodate future sub-corpora.

There is an acknowledged need for a Scottish corpus. Major British corpora, such as the British National Corpus (BNC) and the Bank of English (BE), contain Scottish material but did not collect it comprehensively. The International Corpus of English (ICE) could not include Scotland in its collection of major varieties of World English. The SCOTS project is thus filling a gap in research materials on linguistic variation. Information is lacking on how extensively and in what contexts Scots is still used, and on the range of features characterising Scottish English, questions which the SCOTS corpus will help to answer.

Some of the practical issues which have to be dealt with are copyright (permissions from all authors and participants are needed) and data protection. For every item in the corpus we collect comprehensive metadata about the text and the author/speaker. The metadata records demographic, geographical and social information about the text/recording and author/contributor/performer, and gives the copyright holder the option to restrict the use of information. Technical issues include accessibility, decisions on the presentation of texts, design of the search facilities, whether to include concordances of the texts, and methods of archiving and preserving digital copies, digital originals and paper originals.

Planning the corpus has raised many questions. We are aware of the need for a corpus to be well-balanced and representative; most well known corpora are created from predetermined samples to try to ensure this. The selection criteria for the BNC, for instance, are domain, time, and medium, and target proportions were defined for each of these criteria. We cannot follow this model as the data do not exist to answer several of the questions implicit in determining selection criteria in such a way. The SCOTS project will provide data which will help to establish what is a representative Scots or Scottish English text and how to classify intermediate varieties along the Scottish linguistic continuum. We will learn the contexts in which Scots and Scottish English are used when we have collected sufficient material to analyse the genres represented. At that point we will confront issues of balance in genre, proportion of written/spoken, extracts vs. whole texts, etc. Only thus can we ensure that the corpus accurately reflects the multiple linguistic situation in present-day Scotland.

Web references