SCOTS - Phase 2 report

From final report submitted to Arts and Humanities Research Council, June 2007

Summary

SCOTS uses computer technology and the web to bring a unique electronic collection of Scots and Scottish English texts to scholars and the public. The resource contains 4 million words of written and spoken material, the latter with online audio/video clips and synchronised transcription. Genres range from Scottish parliamentary records to spontaneous mother-child speech from the North-East of Scotland. The search facility has been integrated with Google Maps so that users can easily search for documents by the place of birth or residence of the author or speaker. Users can also investigate where particular words and phrases are used, check their frequency, and see, via a concordancer, how they are used in context. Full text and multimedia files are freely available for download, and, where it is available, information is given about the authors and speakers. Thus it is possible to search, for example, for spoken documents featuring Ayrshire women born in the 1970s. The SCOTS corpus, then, makes available a wealth of contemporary writing and speech from Scotland, and offers researchers, educators and interested members of the public immediate access to a resource that provides many insights into the way people in Scotland use Scots and Scottish English today.

Achievements

The SCOTS project aimed to ‘provide an electronic corpus of two of the major contemporary languages of Scotland, Scots and Scottish English’. It has succeeded in doing this in the terms of the original proposal. By the end of the project, we planned to have 800 documents/4 million words processed, with 800,000 words of speech, recorded and transcribed. In the event, we have 1177 documents/4,030,931 words processed, with 808,318 words of speech. The project includes a greater number of shorter documents than anticipated; otherwise the target has been slightly exceeded.

The SCOTS website (www.scottishcorpus.ac.uk) includes instructions on different levels of search tools available, including online word searches, concordancing, and frequency information. Integration with Google Maps allows easy searches by geographical location, and the Advanced Search allows searches to be restricted by one or more of 196 factors, such as whether the document is spoken/written or the participants’ place/decade of birth, residence and gender.

The spoken documents can be accessed as audio/video files with synchronised orthographic transcription. Both spoken and written documents can be accessed in full online, and, if desired, downloaded. Where such information is available, metadata on sociolinguistic variables pertaining to authors/speakers is also given.

Unlike some digital resources, the document contents of the corpus are plain text, with very rich metadata describing the document and attached authors and participants. Established formats, such as TEI, did not provide the scope for these data to be recorded without significant extension. The approach we adopted was to provide the corpus documents and metadata as XML using our own categories, which can be freely transformed by the user to whatever format they require.

Users can build their own corpus from the SCOTS data, by performing an advanced search. They can then manually select files to download, or download in bulk. Texts can be downloaded as XML files or as plain text files, allowing easy use by additional tools such as WordSmith.

A particular highlight of the SCOTS corpus is the 20% of spoken data, ranging from lectures, formal and informal interviews to spontaneous parent-child interactions. This is a rich resource of high-quality data which promises to be of considerable interest to researchers into sociophonetics, conversation analysis, spoken grammar and child language acquisition, for decades to come.

This potential was evident to participants in the successful two-day international Symposium on Linguistic Variation and Electronic Projects, hosted by the SCOTS project at Glasgow University, in April 2006. This was a useful opportunity to network with cognate projects and to share insights on non-standard corpora from Finland, Italy and elsewhere in the UK. Through symposia, conference participation, press and radio coverage, and, of course, academic publication, the SCOTS corpus is becoming firmly established in scholarly circles and beyond. It will feature prominently in Anderson and Corbett’s planned volume on ‘Exploring English with Online Corpora’ (to be published by Palgrave Macmillan in 2008).

The SCOTS resource has excited even greater public interest than we had foreseen: the website currently receives on average 165,000 hits per month. For a research resource of Scottish English and Scots, the SCOTS corpus has been exceptionally well received abroad, and its applications have been appreciated particularly in the field of English Language Teaching. Corbett has given presentations on the corpus as far afield as South America (Chile and Brazil), and it has featured prominently in a magazine distributed widely in Brazil. It also merits a mention in O’Keefe, McCarthy and Carter’s ‘From Corpus to Classroom’ (CUP), a guide aimed at ELT teachers and materials designers. Local interest has remained high; the SCOTS team has made presentations about the corpus at Glasgow’s West End Festival and in Perth, Aberdeen, Edinburgh and Belfast.

A further added value of the SCOTS resource, particularly in the spoken section, is the cultural significance of some of the recordings. The spoken data is a mixture of oral history, members of the public discussing language attitudes, students discussing youth culture, and interviews, formal and informal, with some literary figures, such as the novelist, Ian Rankin, the poet, James McGonigal, and children’s writer/translator into Scots, Matthew Fitt.

SCOTS in its current form concentrates on Scottish English, but the long-term vision of the corpus is eventually to widen the focus to all the languages of Scotland, including community languages spoken in Scotland (e.g. Punjabi, Cantonese, Polish). Some pilot texts in Gaelic have been included in the corpus. In the absence of up-to-date information about the situation for community languages in Scotland, a small-scale survey was carried out by Stuart-Smith in 2006. The main findings were: 1. an unexpected trend towards the maintenance of established community languages, in particular for Punjabi; 2. an increasing number of bilingual pupils in Scottish schools, with a far wider spread of languages since Scotland’s participation in the dispersal of refugees. Consultation with professionals working with community languages in translation, education and asylum seekers’ services revealed specific needs for the provision of open-access texts and spoken language, and thus strong support for the future inclusion of community language texts in SCOTS.

Within the Department of English Language at Glasgow University, the presence of the SCOTS corpus has provided a focus for our increasing strength in the study of language use in Scotland, and in the provision of large-scale electronic resources. The appointment of Dr Jennifer Smith in 2005 brought expertise in the dialects of NE Scotland, and in the acquisition of dialect by young children; she made available to the project some of her caregiver-child spoken data from Buckie. This has balanced, for example, the urban Scots focus of Dr Jane Stuart-Smith’s ongoing research into Glasgow, and now Central Belt, accents. The imminent completion of the long-term Historical Thesaurus project at Glasgow University, and its availability online, also promises much in the integration of electronic resources in future research and teaching.

Future Plans

In late 2006 we were awarded a Research Grant of £429,997 from the Arts and Humanities Research Council, which enabled us to begin work on the Corpus of Modern Scottish Writing in June 2007.

The project team comprises:
Professor John Corbett, Principal Investigator
Professor Jeremy Smith, Co-Investigator
Dr Wendy Anderson, Research Assistant
David Beavan, Computing Manager
Dorian Grieve, PhD student
Jean Anderson
Professor Christian Kay

Under this grant we will create a new resource, the Corpus of Modern Scottish Writing (CMSW), to complement the SCOTS Corpus. This will be a publicly available, digitised archive of texts in language varieties ranging from Broad Scots to Scottish Standard English which will provide the ‘missing link’ between the Helsinki Corpus of Older Scots and its related projects (1450-1700) and the SCOTS corpus (1945-present day). The content of CMSW will mainly be written texts, selected on the basis of genre and region. The project also aims to advance the implementation of automatic searching for spelling and dialectal variants.

Scottish Corpus Of Texts & Speech

From final report submitted to Arts and Humanities Research Council, June 2007

Summary

Achievements

Future Plans

Scottish
Corpus
Of
Texts &
Speech