PRESENTATION NAME - Lancaster University

PRESENTATION NAME - Lancaster University

COMPILING A FRENCHSLOVENIAN PARALLEL CORPUS Adriana Mezeg University of Ljubljana Department of Translation Studies Presentation plan 1. 2. 3. 4. Introduction Part I: Corpus design and development Part II: Case study Conclusion UCCTS 2010, Edge Hill University, Ormskirk

2 Introduction Definition of a parallel corpus: a collection of electronic texts originally written in a language A alongside their translations into a language B (Baker 1995), built according to explicit design criteria for a specific purpose (Atkins 1992) Situation in Slovenia and motives behind the compilation of a French-Slovenian corpus Objectives UCCTS 2010, Edge Hill University, Ormskirk 3 Part I: Corpus design and development

Medium, genre, size Collecting texts, getting permission from copyright holders Pre-processing Alignment Annotation Some statistics UCCTS 2010, Edge Hill University, Ormskirk 4 Medium, genre, size Medium : written/spoken Availability of French-Slovene written parallel texts : EU documents, legal and administrative texts, promotional texts, journalistic texts, literature Genre choice

Envisaged size: 1 million/language UCCTS 2010, Edge Hill University, Ormskirk 5 Collecting texts, getting permission from copyright holders Journalistic subcorpus: 300 articles from Le Monde diplomatique and Le Monde diplomatique v slovenini (1 164 074 words); copyright permission granted Literary subcorpus: 12 contemporary French novels and their Slovene translations (1 302 911 words); problems getting copyright permission from French publishing houses UCCTS 2010, Edge Hill University, Ormskirk

6 Pre-processing Getting/producing texts in electronic form Cleaning the texts (removal of tables, graphics, pictures, footnotes etc.) Converting the texts into text-only ANSI files (problems with character ); Unicode Utf-8 UCCTS 2010, Edge Hill University, Ormskirk 7 Alignment Commercial automatic alignment tools: WinAlign (Trados), Atril Dj Vu, ParaConc Automatic sentence alignment using Michael Barlows ParaConc

Due to different number of segments (2-1, 12, 1-0), manual correction needed Example of (semi-)automatic alignment: ParaConc ( handout) UCCTS 2010, Edge Hill University, Ormskirk 8 Annotation FraSloK is a morphosyntactically tagged corpus: each word in a corpus is assigned a grammatical tag corresponding to the word class to which it belongs (a part-of-speech tag) POS taggers for French: TreeTagger, MeLT tagger ( handout) UCCTS 2010, Edge Hill University, Ormskirk

9 Some statistics French part Slovene part Total/subcorpus (tokens) Total/corpus (tokens) LMD (journalistic subcorpus) 637 297 526 777 1 164 074 LIT (literary subcorpus)

701 715 601 196 1 302 911 2 466 985 Total/language 1 339 012 1 127 973 2 466 985 Graph 1: Size of the French-Slovenian corpus (FraSloK) and its subcorpora. UCCTS 2010, Edge Hill University, Ormskirk 10 Category

tokens types type/token ratio standardised TTR mean word length sentences average sentence length in words French subcorpus (LMD) 637 297 38 994 6,21 46,89 4,98 25 420 24,71

Slovene subcorpus (LMD) 526 777 63 514 12,27 60,30 5,55 24 002 21,56 Graph 2: General statistics for the French and Slovene journalistic subcorpus. UCCTS 2010, Edge Hill University, Ormskirk 11 Category

French subcorpus (LIT) tokens 701 715 types 41 976 type/token ratio 5,99 standardised TTR 47,77 mean word length 4,54 sentences 42 350 average sentence length 16,55 in words Slovene subcorpus (LIT) 601 196

68 919 11,47 58,61 4,82 42 151 14,25 Graph 3: General statistics for the French and Slovene literary subcorpus. UCCTS 2010, Edge Hill University, Ormskirk 12 Part II: Case study Translation of French detached constructions into Slovenian What are detached constructions? Problem: due to specific syntactic and

semantic characteristics of French initial detached constructions, their translation into Slovene is problematic Hypothesis: explicitation (Vinay and Darbelnet 1958, Blum-Kulka 1986) UCCTS 2010, Edge Hill University, Ormskirk 13 Example: gerundive (en participle) detached constructions Semi-automatic extraction from ParaConc Syntactic patterns based on part-of-speech tags and regular expressions Example: En(\W\w+){0,1}(\W)?\w+ UCCTS 2010, Edge Hill University, Ormskirk

14 Journalistic subcorpus: 90 occurrences out of 96 correct (94 %) Literary subcorpus: 157 occurrences out of 160 correct (98 %) After automatic extraction for all the syntactic patterns and manual elimination of unsuitable examples: 391 French initial DC having a gerund as a base (JC: 134 oc., 34 %; LC: 257 oc., 66 %) UCCTS 2010, Edge Hill University, Ormskirk 15 Distribution of translation strategies for detached constructions with a gerund as a base

80.0% Corpus journalistique Corpus littraire 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0%

0.0% Subordination Coordination Other sentence relations Detached construction UCCTS 2010, Edge Hill University, Ormskirk Adverbial phrase Other 16 Examples

(1) En confiant le sale travail lEthiopie, lexcutif amricain a pris le risque de ranimer des braises mal teintes dans la rgion. (LMD, November 2007) [In entrusting the dirty work to Ethiopia, the American executive risked rekindling badly extinguished embers in the region.] Ko je ameriki izvajalec zaupal umazano delo Etiopiji, je tvegal, da se bo v regiji razpihala erjavica, ki e ni dobro ugasnila. ( Subordination) [When the American executive entrusted the dirty work to Ethiopia, he risked ] (2) Puis, en secouant sa torpeur, jeta d'une voix rauque : Qu'est-ce que tu veux qu'on en fasse ? (Andre Makine, The French Testament) [Then, getting out of the numbness, said pointedly: - What do you want us to do about it?] Potem se je zdrznil iz odrevenelosti in rezko odvrnil: - Kaj pa bi rada, da naredim? ( Coordination) [Then he got out of the numbness and said pointedly: - What do you want me to do?] UCCTS 2010, Edge Hill University, Ormskirk

17 (3) En m'emmenant trois jours en week-end avec son trsorier et ses dobermans, le directeur de la chane a cru me faire passer jamais le got de la gaudriole. (Marie Darrieussecq, Pig Tales. A Novel of Lust and Transformation, 1996) [By taking me for three days to his country house with his treasurer and his Dobermanns, the director of the chain thought that I would repress for ever my desire for hanky-panky.] Direktor me je peljal na vikend z blagajnikom in s svojimi tremi dobermani, bil je preprian, da me bo razuzdanost za vedno minila. ( Other sentence relations, namely juxtaposition (absence of linking elements)) [The director took me to his country house with the treasurer and his three Dobermanns, he was convinced that I would get over for ever my hanky-panky.] (4) En murmurant des supplications confuses, il m'crasa sous son corps. (Shan Sa, Empress, 2003) [By whispering confused pleas, he crushed me with his body.] epetajo zmedene pronje me je sploil s svojo teo. ( Detached construction)

[(By) whispering confused pleas he crushed me with his body.] (5) En souriant, il arrachait sa tunique, dnouait son pantalon de soie et dnudait son corps vigoureux. (Shan Sa, Empress, 2003) [By smiling, he tore off his tunic, undid his silk trousers and stripped naked his vigorous body.] Z nasmehom je s sebe strgal tuniko, si odvezal svilene hlae in razgalil svoje krepko telo. ( Adverbial phrase, namely adjunct of manner) [With a smile he tore off his tunic, undid his silk trousers and revealed his vigorous body.] UCCTS 2010, Edge Hill University, Ormskirk 18 Conclusion Summing-up Future perspectives Corpus enlargement, improvement, enrichment Public access ( SPOOK project,

http://lojze.lugos.si/spook/korpus.html) Further use UCCTS 2010, Edge Hill University, Ormskirk 19 THANK YOU UCCTS 2010, Edge Hill University, Ormskirk 20

Recently Viewed Presentations

  • VC Dimension of Neural Nets Liran Szlak &

    VC Dimension of Neural Nets Liran Szlak &

    Finite Sample Expressivity - Proof. ?1<?1<?2<?2<?3<?3. ???=max{??−??,0} A=?1−?100?2−?1?2−?20?3−?1?3−?2?3−?3. Fact: the eigenvalues of a lower triangular matrix are equal to its diagonal elements. Values on diagonal are all >0 A is full rank
  • Genreteori III - Raffaele Brahe-Orlandi

    Genreteori III - Raffaele Brahe-Orlandi

    Eventyret starter i den favoriserede tilstand (S1). Hvor de to porcelænsfigurer beslutter sig for at blive forlovet. Det, der muliggør det uønskede (-S1), uægte ægteskab er kineserens selvudnævnelse til at kunne bestemme over deres skæbne og give hyrdinden væk til...
  • Oracle10g and Banner Rock Eagle 2005  Admin Computing/Data

    Oracle10g and Banner Rock Eagle 2005 Admin Computing/Data

    CONNECT BY Oracle10g and CBO RBO no longer supported Still present, but Oracle will not fix bugs or enhance SQL Tuning Advisor and SQL Profiles are preferred way of resolving performance issues identified Testing conducted at SunGard SCT and at...
  • State-Building in Fragile States: - World Bank

    State-Building in Fragile States: - World Bank

    While state-building is a high priority in fragile and conflict-affected states, the tool can also help identify and address governance concerns in 'non-fragile' states. In these cases, some parts of the assessment can be done more quickly (e.g. regarding security),...
  • A magyarországi székhelyű európai részvénytársaságokra ...

    A magyarországi székhelyű európai részvénytársaságokra ...

    Európai közjog és politika Jogász szak (nappali/levelező képzés) 2006/2007./I. félév Széchenyi István Egyetem (Győr) Az Európai Unió alapjogvédelmi rendszere Az Európai Unió közjogi alapjai I.) Alapjogvédelmi mechanizmusok Európában II.
  • One Variable vs. Two Variable Data - WordPress.com

    One Variable vs. Two Variable Data - WordPress.com

    One Variable vs. Two Variable Data One Variable Data One-variable data sets give measures of ONE ATTRIBUTE. ... Tally charts Frequency tables Bar graphs Histograms Pictographs Circle graphs Two Variable Data Two-variable data sets give measures of two attributes for...
  • Sharon Duffy Mental Health First Aider (MHFA) England

    Sharon Duffy Mental Health First Aider (MHFA) England

    [email protected] Mental Health First Aider (MHFA)England. ... One in four people in the UK will have a mental health problem at some point in their lifetime. Sadly, over 6,000 people a year die by suicide in the UK.
  • Get OUT of Line and GO On-Line! Pending

    Get OUT of Line and GO On-Line! Pending

    Pending Get OUT of Line and GO On-Line! Cindy Castillo, FA Director, De Anza College & Kevin Harral, FA Director, Foothill College Dataload to Email EdConnect download Assign College Code Tracking Budget BOGC Email Email to Luminis Luminis Student Portal...