BNC: the first decade

BNC-XML: an introduction Lou Burnard What is the BNC? a snapshot of British English, taken at the end of the 20th century 100 million words in approx 4000 different text samples, both spoken (10%) and written (90%) synchronic (1990-4), sampled, general purpose corpus available under licence; latest edition is BNC-XML (13 mar 2007) Production of the BNC

managed by an academic-industrial consortium with significant government funding took three years (at least) cost GBP 1.6 million (at least) came about through an unusual coincidence of interests amongst: Lexicographical publishers Government (DTI) Engineering and Science Research Council Target audience: Lexicographers, NLP researchers, But not language teachers! Remember the Nineties? WinWord or WP5? the choice is yours

On your desk a 386 with 50 Mb diskspace (just about enough to run Windows 3) In your lab ... a VAX or a Sparc for serious work On the WWW (maybe) ... Mosaic for X Little text in digital format Text encoding (under development) TEI SGML Corpus linguistics 90s-style

a world without the web! corpus linguistics Traditionalists (ICAME) Expansionists (LDC, monitor corpora) text encoding theory language engineering and NLP the JFIT mentality Project Goals Stated A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production of non-opportunistic design, for generic

applicability with word class annotation and contextual information Unstated better, more authoritative, learner dictionaries a new template for European language resources a REALLY BIG corpus The BNC sausage machine Written OUP (OUP/ Chambers) Spoken (Longman) Selection, clearance, and capture Enrichment and encoding

Initial CDIF Conversion and Validation (OUCS) Word Class Annotation (UCREL) Header generation and final validation (OUCS) Documentation, distribution, maintenance Distinctive features of the BNC non-opportunistic design standardized markup system structural annotation word class annotation contextual information

general availability these respects, the BNC remains distinctive, twenty years on! Why BNC XML? The BNC is still widely used ... but the technology has moved on XML tools are everywhere ... so using the corpus is much easier Conversion to XML was easy and (fairly) automatic ... but with more tractable markup some dusty corners needed sweeping out What's in the BNC?

79238146 6175896 4233955 8715786 Spoken Demographic Spoken Context Governed Books and Periodicals Other written Needles and haystacks The BNC has an extraordinary range travel agent brochures, weather reports, formal invitations, advertising, publicity leaflets, children's talk, academic discourse, doctor's consultations, marketing meetings, oral history, jokes and anecdotes, high literature, best-sellers, business

letters, personal diaries and correspondence ... The problem is finding the specific texts you want Selection criteria Descriptive criteria Post-hoc categorization (or use the WLD principle) BNC Design Criteria for written texts (90%) Medium (books, newspapers, unpublished) Domain (informative, entertaining) Criteria for transcribed speech events (10%)

Context governed half predefined list of speech situations Demographically sampled half 200 volunteers, sampled for age, sex, region These selection criteria make up a taxonomy, which is defined in the corpus header What topics? 16496420 12237834 3821902 3037533 14025537

6574857 7341163 7174152 17244534 Imaginative Applied Science Arts Scientific World Affairs Belief Social Science Commerce Leisure Descriptive criteria spoken texts speaker occupation, perceived accent, education level,

personal relationship speech domain, region, locale written texts author age, sex, type audience, circulation, status text-type classification These criteria were used to maximize variation once selectional constraints had been applied Post-hoc text-type classification Academic Literary Press Nonfiction Unpublished Conversation OtherSpolen

...sentences ...words Annotation, encoding, markup A means of making explicit, and thus processable: structure texts, sections, paragraphs, turns, sentences, words... metadata text-type, situational parameters, context analysis morphology, syntactic function, translation Adopting a single framework facilitates integration and sharing of fragmentary resources

thus enhancing research outcomes also makes tool development much easier BNC structure bnc teiHeader teiHeader bncdoc bncdoc bncDoc wtext 4049 stext

908 BNC-XML structure wtext stext div 1 div 1,599,69 p pp 2 p div div ss ss ss s uu

uu 6,026,284 ww ww ww w 98,363,784 784,484 Word class annotation CLAWS (Leech, Garside et al) approach What counts as a word? This isn't prima facie obvious, in spite of spelling conventions. In BNC-XML, each word is explicitly marked and annotated with

a root form or lemma an automatically assigned C5 word class code a simplified POS code Words and multiwords English orthography can be misleading ... in spite of common sense ... it wasn't me In BNC XML, some multiwords are explicitly marked: in spite of

c5=PNP c5=VBD c5=XX0 c5=PNP pos=PREP hw=it>it pos=VERB hw=be>was pos=ADV hw=not>n't pos=PRON hw=i>me Structure of written texts Most written texts are organized hierarchically into various kinds of division, shown by headings or other features:


Some divisions are typed: e.g. chapter, section, story, subsection, column, front, part, recipe, leaflet... all spoken texts are divided into conversations Features of written texts Paragraph-like

marks paragraphs marks headings or captions marks lists marks quotes marks verse lines

Paragraph-parts for typographic highlighting for corrected passages for deliberate omissions for page breaks Speech in writing... Mr. Skinner ...

That millionaire mammy 's boy Interruption

Mr. Speaker

Order . That is not wholly unparliamentary .

Structure of spoken texts

marks a stretch of speech initiated by speaker identified as XXX marks a synchronization point detailed information on speakers is given in the text header other features of transcribed speech are also marked... Features of spoken texts

marks changes in voice quality e.g. whispering, laughing, etc., both as discrete events and as changes in voice quality affecting passages within an utterance. marks non-verbal but vocalised sounds e.g. coughs, humming noises etc. marks non-verbal and non-vocal events e.g. passing lorries, animal noises, and other matters considered worthy of note. marks significant pauses silence, within or between utterances, longer than was judged normal for the speaker or speakers.

marks unclear passages whole utterances or passages within them which were inaudible or incomprehensible for a variety of reasons. baby baby burped baby cries baby cry baby crying baby crying in background baby gurgling baby laughing baby noise baby noises baby screaming baby shouting baby shouting over the top baby shouts baby speaking

baby squealing baby talk baby talking background chatter background chatter in pub background chatter in pub background chatting shuffling etcetera background conversation event description Vocal descriptions

desc="big breath"/> desc="breathing out suddenly"/> desc="drawing in breath"/> desc="exhales"/> desc="indrawn breath"/> desc="inhales"/> desc="intake of breath"/> desc="sharp intake of breath"/> desc="takes a deep breath"/> desc="takes breath"/> Contextual information each text has a TEI header identification and classification specific details (e.g. speakers)

all common data in the corpus header classification(s) in header are pointed to by individual texts Structure of the TEI Header File Description Title Statement Responsibility Statement/s Edition Statement Extent Publication Statement Identification numbers Source Description

Encoding Description Tagging Declaration Profile Description Creation [Participant Description] Text Classification Revision Description The title Statement How we won the open: the caddies' stories. Sample containing about 36083 words from a book (domain: leisure) Harlow Women's Institute committee <titleStmt> meeting. aboutSample 246 containing words <!--page-separator--></p> <p><title>TheSample age of containing capital 1848-1875. about 41650 wordsrecorded from a bookin(domain: affairs) speech publicworld context 32 Dataconversations capture and transcription recorded by `Frank' (PS09E) <name>21 Oxford <!--page-separator--></p> <p>Press </name> between and University 28 February 1992 with 9 interlocutors, </respStmt> totalling 3193 s-units, 20607 words, and 3 hours 22 </titleStmt> minutes 23 seconds of recordings. [Leaflets advertising goods and products]. Sample containing about 23409 words of miscellanea (domain: commerce) The edition statement BNC XML Edition, December 2006 41650 tokens; 41573 w-units; 1436 s-units

Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium. This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at for full licencing and distribution conditions. J0P AgeCap The source description 1The age of capital 1848-1875. Hobsbawm, E J Abacus London 1977

The source description 2 The encoding description

The profile description (written) W nonAc: humanities arts History, Modern - 19th century Capitalism - History - 19th century World, 1848-1875 Classification codes Codes used are predefined in the Corpus header Written Domain Imaginative

Natural and pure sciences Applied sciences ... The profile description (spoken) 1992-02-23 20 Wayne unemployed Central South-west England ....

Hampshire: Andover local shop visiting friends ... Has English moved on? types of text

e-mail web pages / blogs SMS personal letters topics globalization internet Elvis Word Perfect Out of date? The composition (and date) of any corpus affects inferences drawn from it

There aren't many alternatives Web-as-corpus sources of spoken texts? monitor corpora are non-replicable copyright permissions unrepeatable Quantitative and qualitative comparative evaluations of BNC coverage are needed but it's surprising how much is there Why is it still useful? The BNC is a problematizing resource... complements (and corrects) intuition increases learner autonomy

critiques the myth of the native speaker ... for teacher and learner alike XML makes it more usable by non-specialist software Its range and availability make it unique Where can I get one? BNC XML: now available on DVD standalone single user licence or institutional licence existing licensees should renew XAIRA

Delivered free with the BNC (and also available free from Usable with any XML corpus Usable/ish on any platform

