Gender Classification of Japanese Authors David Edwards &

Gender Classification of Japanese Authors David Edwards &

Gender Classification of Japanese Authors David Edwards & Cybelle Smith Gendered Speech in Japanese Gender of speaker may be overtly marked: Gender-specific first-person pronouns ,boku,boku, male; , ore, male; ,boku,watashi, female or neutral Question: Does gender have less-overt effects on Japanese texts as well? Can word choice, morphology, writing style indicate gender, even in noisy environments like fiction writing? Corpora Peace Corpus 29 personal essays by middle school students Topic: Peace 29 authors: 22 female 7 male Bookstudio Corpus 485 installments of online novels Genre: Fantasy 40 authors 20 female 20 male Also collected ~181 installments from authors of unknown gender (for future research) Our Baseline - The Boku Test

Corpus Male Accuracy Female Accuracy Overall Accuracy Peace .71 1.0 .93 Bookstudio .91 .43 .67 Classifiers Used Nave Bayes: Build conditional probabilities of features given gender Calculate probability of test data given a particular gender Select highest-probability gender SVM: Used the LIBSVM free classifying tool Find dividing hyperplane in num-feature dimensional space - Requires problem-specific parameters chosen via cross-validation

Apply hyperplane to test data Also attempted Logistic Regression Chasen: Segmenter and POS-tagger Stem Pronun Lemma -ciation

Part of Speech - - - - - - - - - - Features Stem Pron Lemma POS - KURAki kuraki adjective - independent KURAi Features

Kanji (Chinese character) Hiragana (phonetic) Katakana (phonetic, like italics) Single-feature performance on Naive-Bayes: Feature Indic Stem Lem Pron POS Quot WS SPDWS1 SPDWS2 Male Accuracy .29 .67 .68 .70 .80 .23 .66 .49 .87 Female Accuracy .51 .77

.78 .74 .45 .33 .85 .81 .68 Overall Accuracy .72 .73 .72 .63 .28 .76 .66 .77 .40 Multi-feature performance on Naive-Bayes: Trial Stem Lem Pron POS Quot WS SPD SPD Male Female Overall WS1 WS2 Acc. Acc. Acc.

1 X 2 X 3 X 4 X 5 X .63 .73 .68 X .81 .73 .77 X

.70 .76 .73 .68 .76 .72 X .68 .78 .73 X X 6 X X X X 7

X X X X X X X X .70 .70 .70 X X X .70 .73 .71

SVM Performance Optimizations: Scaling counts to avoid swamping low-frequency features Selecting optimal error rate and kernel parameters Accuracy Features No Scaling Scaling Cross Validation (Training Set) Cross Validation (Test Set) All features (except quotations) 50.6% 48.5% 79.7% 50.0% Part of Speech 50.9%

53.0% 68.0% 47.3% Wordshape 50.6% 63.3% 75.2% 50.6% 64% 77.8% 51.8% Pronunciation 50.6% Conclusion Without considering gendered pronouns, we achieved similar performance Most-indicative feature: wordshape (use of kanji vs. hiragana vs. katakana etc.), especially where multiple options exist Point of interest: male and female Japanese authors differ not just in the words they use, but how they choose to write those words

Recently Viewed Presentations

  • NSF Grant Number: DMI- ______________ PI ...

    NSF Grant Number: DMI- ______________ PI ...

    PI: Gerd Kortemeyer Institution: Michigan State University Title: Investigation of a Model for Online Resource Creation and Sharing in Educational Settings Research on effective mechanisms for sharing online educational resources (content pages, homework and exam problems, etc) across disciplinary and...
  • Adobe Illustrator CS5 Unit C:

    Adobe Illustrator CS5 Unit C:

    FIGURE C-5: Viewing two object guides. Modifying Objects with the Direct Selection Tool. ... FIGURE C-9: Red rectangle sent to the back of the stacking order. FIGURE C-10: Moving the blue oval forward in the stacking order. Working with the...
  • Equity Release, Fresh Thinking For 2015 Please view

    Equity Release, Fresh Thinking For 2015 Please view

    Please view in full-screen presentation mode. ... All equity release advice must be provided by a qualified adviser. Just as important is the level of experience an adviser has in arranging a large variety of plans. ... Visit our referrals...
  • Digital Library Service Integration Senior Projects

    Digital Library Service Integration Senior Projects

    Used in NJIT Courses w/old prototype. PHIL 334 - Engineering Ethics. Essay questions about ethics scenarios. Quizzes (true/false, matching, short answer
  • EE 345S Real-Time Digital Signal Processing Lab Fall 2007

    EE 345S Real-Time Digital Signal Processing Lab Fall 2007

    EE 445S Real-Time Digital Signal Processing Lab Spring 2017 Lab #2 Generating a Sine Wave Using the Hardware & Software Tools for the TI TMS320C6748 DSP (Continued) Debarati Kundu and Sam Kanawati (with the help of Mr. Eric Wilbur, TI)...
  • CoastWatch West Coast Node Report 2005

    CoastWatch West Coast Node Report 2005

    CoastWatch Node Manager's Meeting, 11-13 October 2005, Pacific Grove, CA West Coast CoastWatch Node Report 2005 West Coast CoastWatch Node Outline West Coast Regional Node History WCRN Data Access Ocean Watch LAS Live Access Server (LAS) New CoastWatch Browser New...
  •  WIFI name : emerald  WIFI pass : goodway1111

    WIFI name : emerald WIFI pass : goodway1111

    This presentation uses a free template provided by FPPT.com. www.free-power-point-templates.com
  • Residential Tenancies Act 2004 & Residential Tenancies ...

    Residential Tenancies Act 2004 & Residential Tenancies ...

    If rent is repaid within the 14 days, the Notice cannot be served and a fresh notice is to be served if further arrears occur. If rent is repaid on the 15th day or later, the Notice can still be...