The very first electronic corpus, the BROWN corpus of written American English, contained 500 text samples of 2,000+ words, and its size and composition set a standard for the compilation of new corpora for many years. For a long time after the publication of the corpus in 1964, many people thought that 1 Mio. words of text represented a general enough sample to provide enough information about the make-up of the English language. However, over the years, researchers have realised that even this type of size and stratification may not be sufficient for performing certain tasks – such as research in lexicography or collocations – efficiently.
For building corpora of older forms of language – such as Old or Middle English –, natural limitations exist in that there are limited numbers of texts available, as well as the fact that these obviously cannot exist but in written form. For modern corpora, however, and especially with the advent of recording technology, the choice of materials has become much more difficult. Here, we need to consider how we want to obtain our data in the first place. Do we still want only written language, such as for the first generation corpora, or do we want to include spoken language in transcribed form, too? Or do we want a fully fledged spoken corpus that will never be published in any general written form?
The composition of the first 1 Mio. word corpora consisted of samples of the following 16 written genres.
| Label | Text Category/Genre |
|---|---|
| A | Press: Reportage |
| B | Press: Editorial |
| C | Press: Reviews |
| D | Religion |
| E | Skills, Trades & Hobbies |
| F | Popular Lore |
| G | Belles Lettres, Biography, Essays |
| H | Miscellaneous: Government Documents, Foundation Reports, Industry Reports, College Catalogue, Industry House Organ |
| J | Learned & Scientific Writing |
| K | General Fiction |
| L | Mystery & Detective Fiction |
| M | Science Fiction |
| N | Adventure & Western Fiction |
| P | Romance & Love Story |
| R | Humour |
The BNC, as an example of a modern mega-corpus, attempts to strike at least some kind of balance between spoken and written materials, although the percentage of spoken materials (10%) is still rather low, which possibly exemplifies an unfortunate continuing dominance of written language in linguistics. On the other hand, maybe keeping the amount of spoken data in the BNC relatively low was actually not too bad an idea, since transcribing spoken language is an expensive and time-consuming business, and one where corpus compilers often take too many ‘shortcuts’. In the case of the BNC, this can unfortunately be seen rather clearly in the quality of some of the transcriptions, where e.g. many apostrophes have ended up in the wrong places or have gone missing, so that some plural markers are turned into a genitive s or the contraction we’re turned into were at least 6 times within a single dialogue (<bncDoc id=D96>).
Within the written section of the BNC, there is a 75% : 25% balance between ‘informative’ and ‘imaginative’ prose, which also includes a certain amount of ‘written-to-be-spoken’ materials, i.e. speeches, plays, etc. The spoken part consists of a ‘context-governed’ section, sampled from public recordings, etc., and a ‘demographic’ one, consisting of recordings made by private individuals who carried tape recorders with them for a period of two days respectively.
The size a corpus ought to or may have depends on a few different factors. First of all, as we have already said before, there may be limitations in terms of the amount of material that is available, in which case it may be necessary to be content with whatever data one can obtain. Apart from ‘natural limitations’ – such as for corpora of older variants of language – some limitations may be imposed by funding. This often raises the issue of quantity vs. quality, for which there seems to be an unfortunate tendency towards tending for the former at the cost of the latter, especially for the larger corpora of 100 Mio. words, such as the BNC or the ANC. On the other hand, though, if a corpus is too small, it may not be very useful for general purpose research because the amount of data needed to conduct research into e.g. collocations apparently increases exponentially (c.f. Ooi, 1998: pp. 55-56) with the length of the n-gram to be collocated. The more domain-specific the research interest is, the smaller the corpus can be because often it is only necessary to extract specialised vocabulary or constructions from it in such cases.
A further issue in the compilation of corpora is that of balance or representativeness. In principle, this only applies to corpora for general use, though, as it is these that ought to provide an equal amount of materials from many different genres or areas of interest. Obviously, this is an aim that is very hard – if not even impossible – to achieve...
Biber, D., Conrad, S. & Reppen, R. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: CUP.
Kennedy, G. 1998. An Introduction to Corpus Linguistics. London: Longman.
Ooi, V. 1998. Computer Corpus Lexicography. Edinburgh: EUP.