1 Algorithms and Corpora for Persian Plagiarism Detection:

1 Algorithms and Corpora for Persian Plagiarism Detection:

1 Algorithms and Corpora for Persian Plagiarism Detection: Overview of PAN at FIRE 2016 Habibollah Asghari ICT Research Institute, ACECR, Tehran, Iran Outline of Presentation History of works in Persian PD activities Persian Plagdet 2016 Evaluation Framework Results of the Shared task 3 Persian Language Characteristics Persian belongs to Arabic script-based languages share some common linguistic characteristics: Absence of capitalization

Right to left direction Lack of clear word boundaries Complex word structure Encoding issues in computer environment A high degree of ambiguity due to nonrepresentation of short vowels in the writing system . 4 1 History of works in Persian Plagiarism Detection 5 Plagdet Activities in Iran PlagDet Task at AAIC: The first competition on Persian plagiarism detection was held with the 3rd AmirKabir Artificial Intelligence Competition (AAIC 2015) Results in the release of the first PD corpus in Persian The PAN structure on evaluation and corpus annotation was used in this competition. 6

Lessons learned from AmirKabir AAIC The low number of Participants Importance of CFP High rate of Withdraws Free registration Just local teams participate in Competition Using an online platform to support remote participation Unfamiliarity with rules and criteria More effective communication with participants using Discussion groups and email Problems with running the codes Exploiting a unified Platform (TIRA) instead of running the codes on usesrs computers Incentives for the participants High ranked teams will be awarded Holding a Local ceremony for introducing the award winners Intellectual property of the corpora Writing terms and conditions 7

2 Persian Plagdet 2016 8 Starting the shared task Important Dates Skype Meeting with PAN Organizers May, 18, 2016 Starting Persian Plagdet 2016 Letter of Request to PAN Organizers Nov, 29, 2015 2015 Nov 2015 April 2016 Submitting the Proposal April, 20, 2016 Results Declared

Sept, 15, 2016 Test Data Release Aug, 22, 2016 May 2016 July 2016 August 2016 Sept 2016 Oct 2016 2016 Run submission deadline Sept, 1, 2016 Approve of the Proposal May, 1, 2016 Training Data Release July, 15, 2016

Working Notes Due Oct, 15, 2016 9 Shared task Description Subtask #1: Plagiarism Detection To identify the similarity of text fragments between the pairs of suspicious document and the sources documents. Subtask #2: Text Alignment Corpus Construction The construction of text alignment plagiarism detection corpus. The corpus would be Persian mono-lingual or bi-lingual with the compound of Persian and any other languages. The main objectives are: o Validating of Persian PD corpora. o To evaluate submitted corpora in order to rank them based on their quality. o The proposed PD systems of subtask#1 would be run on submitted corpora. 10 Main Challenges (Second subtask) How to evaluate the corpora? Finding some quality measures

Versatility in types of obfuscation Tag correctness Size of the corpus Topic versatility Existence of Real plagiarism (are they run a near duplicate code?) Corpora ownership Writing terms and conditions Granting open source corpora 11 Launching the shared task website 12 Call for Participation Email of CFP to all role players in Iran Sending CFP to Iranian Students in Universities Overseas FIRE mailing list CLEF mailing list

Dbworld mailing list Corpora-list mailing list PAN-AraPlagDet mailing list Related NLP groups in Linkedin 13 Participants A total of 31 Teams were registered Iran (23) USA (1) United Kingdom (2) Germany (1) Caspian Sea Ukraine (1) India (2) Egypt (1) Eleven teams in the final stage OS Requested on TIRA Participant groups

Ubuntu 14.0.4 desktop Windows 7 Ubunto 12.0.4 Desktop 4th Qtr 14 3 Evaluation Framework 15 Evaluation Framework Corpus Submission Platform Evaluation Measure PD Corpus Crowd-Sourcing TIRA Platform 16 Training Corpus Statistics

Corpus Statistics of documents Entire corpus Number Number of plagiarism cases Document Source documents purpose Suspicious documents Short (1-500 words) Document Medium (500-2500 words) length Long (2500-21000 words) Small (5% - 20%) Plagiarism Medium (21% - 50%) per Much (50% - 80%) Document Entirely (>80%) 5830 4118 48% 52%

35% 59% 06% 57% 15% 18% 10% Plagiarism Case Statistics Number of cases 1628 None (exact copy) 11% Artificial Low High 81% 40% 41% Simulated Short (30 - 50 words) Case length Medium (100-200 words) Long (200-300 words) 08%

35% 38% 27% Obfuscation The statistics of the training corpus is almost the same as PAN, but the length of the documents are about 10% larger than that of PAN 17 Training Corpus Statistics 18 Crowd sourcing Platform Two Criteria for removing poor quality plagiarized passages (~10%) : The resulted passage is very shorter than the original one The resulted passage is too similar to the original one 19 Crowd Sourcing Platform Corpus Builder Tuning

20 Crowd workers Demographics Workers Demographics Age Education Tasks per worker Gender 25 30 30 40 40 58 College BSc. MSc. PhD Average Std. deviation Minimum Maximum Male Female 41%

38% 21% 05% 25% 58% 12% 19.0 14.5 01 54 74% 26% 21 TIRA EaaS Platform Provides virtual machines allows for convenient deployment and execution A variety of operating Systems available to participants TIRA offers a convenient web GUI that allows selfevaluate TIRA allows for multiple evaluation of submitted software against test datasets in the future, improving reproducibility 22

4 Results of the Subtask #1 23 Results of Subtask #1 Overall detection performance for the nine approaches submitted. Rank / Team Runtime (h:m:s) Recall 1 Mashhadi 2 Gharavi 3 Momtaz 4 Minaei 5 Esteki 6 Talebpour 7 Ehsan 8 Gillam 9 Mansourizadeh

02:22:48 00:01:03 00:16:08 00:01:33 00:44:03 02:24:19 00:24:08 21:08:54 00:02:38 0.9191 0.8582 0.8504 0.7960 0.7012 0.8361 0.7049 0.4140 0.8065 Precision Granularity F-Measure PlagDet 0.9268 0.9592 0.8925 0.9203

0.9333 0.9638 0.7496 0.7548 0.9000 1.0014 1 1 1.0396 1 1.2275 1 1.5280 3.5369 0.9230 0.9059 0.8710 0.8536 0.8008 0.8954 0.7266 0.5347 0.8507

0.9220 0.9059 0.8710 0.8301 0.8008 0.7749 0.7266 0.3996 0.3899 24 Results of Subtask #1 Detection performance of submitted runs dependent on obfuscation type No obfuscation (Verbatim Copy) Simulated Obfuscation 1 0.9663 0.9473 0.9416 1.0006 0.9440

0.8045 0.9336 Gharavi 0.9825 0.9762 1 0.9793 0.8979 0.9647 1 0.9301 0.6895 0.9682 1 0.8054

Momtaz 0.9532 0.8965 1 0.9240 0.9019 0.8979 1 0.8999 0.6534 0.9119 1 0.7613 Minaei

0.9659 0.8663 1.0113 0.9060 0.8514 0.9324 1.0240 0.8750 0.5618 0.9110 Esteki 0.9781 0.9689 1 0.9735 0.7758 0.9473 0.8530 0.3683 0.8982 Talebpour

0.9755 0.9775 1 0.9765 0.8971 0.9674 1.2074 0.8149 0.5961 0.9582 Ehsan 0.8065 0.7333 1 0.7682 0.7542 0.7573 0.7557

0.5154 0.7858 Gillam 0.7588 0.6257 1.4857 0.5221 0.4236 0.7744 1.5351 0.4080 0.2564 0.7748 1.5308 0.2876 Mansourizadeh 0.9615 0.8821 3.7740 0.4080 0.8891 0.9129 3.6011 0.4091 0.4944 0.8791 3.1494 0.3082

1 1 PlagDet Granularity Precision 0.9939 0.9403 Recall PlagDet Granularity Precision Recall Mashhadirajab

Recall PlagDet Granularity Precision Team Artificial Obfuscation 1.0047 0.8613 1.1173 0.6422 1 0.5224 1.4111 0.5788 1 0.6225 25

5 Results of the Subtask #2 26 TIRA EaaS Platform Five Corpora Submitted The source of the documents journal and Conference papers Wikipedia articles in Persian Books and novels translated in Persian Quantitative/Qualitative analysis of submitted corpora Manually checking of submitted corpora Offsets and length of Plagiarized passages according to PAN 27 Results of Subtask #2 Corpus validation Corpus statistics for the submitted corpora Entire corpus Document purpose Document length

Plagiarism per document Case length Obfuscation types Number of documents Number of plagiarism cases Source documents Suspicious documents Short (1-10000 words) Medium (10000-30000 words) Long (> 30000 words) Hardly (<20%) Medium (20%-50%) Much (50%-80%) Entirely (>80%) Short (1-500 words) Medium (500-1500 words) Long (>1500 words) No obfuscation (exact copy) Artificial (word replacement) Artificial (synonym replacement) Artificial (POS-preserving shuffling) Random

Semantic Near Copy Summarizing Paraphrasing Modified Copy Circle Translation Semantic-based meaning Auto Translation Translation Simulated Shuffle Sentences Combination Niknam 3218 2308 52% 48% 35% 56% 9% 71% 28% 1% 21%

76% 3% 25% 27% 25% 23% - Samim 4707 5862 50% 50% 2% 48% 50% 29% 25% 31% 15% 15% 22% 63% 40% 40%

20% - Mashhadirajab 11089 11603 48% 52% 53% 32% 15% 39% 14% 20% 27% 6% 52% 42% 17% 28% 33% 6% 4% 3% 1%

2% 6% - ICTRC 5755 3745 49% 51% 91% 8% 1% 57% 37% 6% 51% 46% 3% 10% 81% 9% - Abnar 2470

12061 20% 80% 51% 48% 1% 29% 60% 10% 1% 45% 54% 1% 22% 15% 21% 21% 21% 28 Results of Subtask #2 Corpus validation Length distribution of documents

29 Results of Subtask #2 Corpus validation Length distribution of fragments 30 Results of Subtask #2 Corpus validation Ratio of plagiarism per document. 31 Results of Subtask #2 Corpus validation Start position of plagiarized fragments in suspicious documents 32 Results of Subtask #2

Corpus validation Start position of plagiarized fragments in source documents. 33 Results of Subtask #2 Corpus validation Comparison of Simulated part of Mashhadi and ICTRC corpora With PAN Corpus 34 Results of Subtask #2 Corpus validation Comparison of Artificial part of Niknam, Samim, Mashhadi and ICTRC corpora With each other 35 Corpora Evaluation PlagDet performance of some submitted approaches on the submitted corpora The objective is how difficult it is to detect plagiarism within these corpora.

Team Niknam Samim Mashhadirajab ICTRC Abnar Gharavi Momtaz Minaei Esteki Ehsan 0.8657 0.8161 0.9042 0.5758 0.7196 0.7386

0.6585 0.5367 0.5784 0.3877 0.4014 0.9253 0.8924 0.8633 0.7104 0.3927 0.7218 0.3830 0.5890 Mansourizadeh 0.2984 - 0.1286

- 0.2687 36 Note book papers 1. Ehsan, N, Shakery, A. A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection 2. Esteki, F, Safi Esfahani, F. A Plagiarism Detection Approach Based on SVM for Persian Texts, 3. Gharavi, E, Bijari, k, Zahirnia, K, Veisi, H. A Deep Learning Approach to Persian Plagiarism Detection, 4. Gillam, L., and Vartapetiance, A., From English to Persian: Conversion of Text Alignment for Plagiarism Detection, 5. Mansoorizadeh, M, Rahgooy, T. Persian Plagiarism Detection Using Sentence Correlations, 6. Mashhadirajab, F, Shamsfard, M. A Text Alignment Algorithm Based on Prediction of Obfuscation Types Using SVM Neural Network, 7. Mashhadirajab, F, Shamsfard, M, Adelkhah, R, Shafiee, F., Saedi, S. A Text Alignment Corpus for Persian Plagiarism Detection, 8. Minaei, B, Niknam, M. An n-gram based Method for Nearly Copy Detection in Plagiarism Systems, 9. Momtaz, M, Bijari, K, Salehi, M, Veisi, H. Graph-based Approach to Text Alignment for Plagiarism Detection in Persian Documents, 10. Rezaei Sharifabadi, M., Eftekhari, S. A. Mahak Samim: A Corpus of Persian Academic Texts for Evaluating Plagiarism Detection Systems,

11. Talebpour, A, Shirzadi, M, Aminolroaya, Z. Plagiarism Detection based on a Novel Trie-based Approach 37 Thank you for your Attention

Recently Viewed Presentations

  • for Adult Education by Sorcha Moran

    for Adult Education by Sorcha Moran

    Mark measurements on a ruler (in/cm) Measure a line . Measure everyday items: pencil, eraser, finger, etc. Adding Fractions, Same Denominator. Adding Fractions, Different Denominators - Visual. Back to the playdough!
  • Fight Fire With Fire - teachers.henrico.k12.va.us

    Fight Fire With Fire - teachers.henrico.k12.va.us

    Back Fear "Bursting with fear" This line pretty much speaks for its self. People were truly scared and so was the government. They were scared of communism coming to America and afraid of being bombed. Back Repetition The repetition of...
  • Greek traditions 2 - uniba.sk

    Greek traditions 2 - uniba.sk

    - The Holy Mountain prefecture in Macedonia is considered an independent pure monastery region, which administratively belongs to the Patriarchate of Constantinople. Greek Traditions 2 Chatzigiannaki Sofia
  • Programs and Projects - WordPress.com

    Programs and Projects - WordPress.com

    Jessica Schafer. Case Studies: Norman and Elsa Rush, Co-Directors, Peace Corps Botswana, 1978 to 1983. Terry Buss, Director Carnegie Mellon Heinz School in Australia. The Socialist, 1967. Picard "Socialism and the Field Administrator." Haslam, et. al., Chapters 8 and 9.
  • Chapter 5: Gender and Sexuality - Baker Publishing Group

    Chapter 5: Gender and Sexuality - Baker Publishing Group

    Verdana Arial Wingdings Calibri MS Pゴシック Times New Roman Elephant Eclipse 1_Eclipse Chapter 8: Kinship and Marriage Kinship Descent Slide 4 Unilineal Descent Patrilineal Descent Unilineal Descent Slide 8 Unilineal Descent Cognatic Descent Slide 11 Functions of Descent Systems Marriage...
  • Start-Up - Discussion 1/5/15 You have all studied

    Start-Up - Discussion 1/5/15 You have all studied

    "A Sight in Camp in the Daybreak Gray and Dim" Published in 1865 as a part of Whitman's book of war poetry called Drum Taps. First, read the poem silently to yourself. Now, in your groups, discuss your initial thoughts...
  • Domain Name System - UPT

    Domain Name System - UPT

    Servere de mail Componenta de Transfer - Componentele serverului de mail : MTA - Mail Transfer Agent MUA - Mail User Agent - Protocolul SMTP (Simple Mail Transfer Protocol) Aplicatie ce utilizeaza portul 25 TCP Comenzi uzuale : helo ,...
  • 普世歡騰 Joy to the World! 普世歡騰救主降臨 全地接祂為王 萬心為主預備地方 宇宙萬物歌唱 宇宙萬物歌唱 ...

    普世歡騰 Joy to the World! 普世歡騰救主降臨 全地接祂為王 萬心為主預備地方 宇宙萬物歌唱 宇宙萬物歌唱 ...

    Joy to the World! 普世歡騰救主降臨 全地接祂為王 萬心為主預備地方 宇宙萬物歌唱 宇宙萬物歌唱 宇宙宇宙萬物歌唱 Joy to the world! The Lord is come Let earth receive her King Let every heart prepare Him room And heaven and nature sing And heaven and nature ...