Functional Gene Clustering via Gene Annotation Sentences, MeSH

Functional Gene Clustering via Gene Annotation Sentences, MeSH

Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology Madurai Kamaraj University Madurai 625021, INDIA Purpose & Goals Extracting gene specific functional keywords from biological literature Augment extracted keywords with MeSH and GO keywords related to gene Compare the accuracy of results with a test data set in various keyword extraction methods 2

From full-abstracts Gene specific sentences Full-abstracts Gene specific sentences Gene specific sentences + MeSH keywords Gene specific sentences+ MeSH and GO keywords Use the keyword extraction method to cluster the differentially expressed gene clusters in a microarray experiments Outline Two Parts: I, and II Part I: Text mining and keyword extraction from literature Our text mining methodology Part II: Applications to microarrays 3 ?

Functional keyword clustering of microarray data Part I: Text Mining Text Mining: Introduction and overview 5 Text mining aims to identify non-trivial, implicit, previously unknown, and potentially useful patterns in text (e.g. classification system, association rules, hyphothesis etc.) includes more established research areas such as information retrieval (IR), natural language processing (NLP), information extraction (IE), and traditional data mining (DM) relevant to bioinformatics because of explosive growth of biomedical literature (e.g. MEDLINE 15 million records) availability of some information in textual form only, e.g. clinical records Text Mining: System Architecture M eSH / G e n e O n to lo g y

M ic ro a rra y E x p e rim e n t M e d L in e A b s tr a c ts F ilte rin g M e S H /G O Y o u r s tu ff h e re . K ey w o rd E x tr a c tio n G e n e L is t G e n e /P r o te in D ic tio n a ry Set of A b s tr a c t S e n te n c e Y o u r s tu ff h e re . E x c tra c tio n A n n o ta tio n K ey w o rd Y o u r s tu ff h e re . E x tr a c tio n P a tte r n s V is u a liz a tio n F e a tu r e V e c to r Y o u r s tu ff h e re .

G e n e ra tio n C lu s te rin g Experimental design of gene clustering with sentences-level, MeSH and GO keywords 6 Text Mining: Keyword Extraction from Biomedical Literature Steps to extract sentence-level keywords Gene - Synonym dictionary A special gene name synonym name dictionary was created for human genes using Entrez-Gene Gene-name normalization - This process replaces all the gene names in the abstract with its unique canonical identifier (Entrez gene ID) using the gene-synonym dictionary specially constructed for this study. Sentence filtering using corpus specific the regular expression as the following example ($gene @{0,6} $action (of|with) @{0,2} $gene) 7

extracts sentences that match the structure shown below the expression. The notational construct A B ... is interpreted as A followed by B followed by .... gene name 0-6 words action verb of or with 0-2 words gene name Keyword extraction. Next slide Text Mining: Keyword Extraction from biomedical literature Table 1. An example set of regular expressions as nouns describing agents and agents, and passive and active verbs 8 Name of Expression Expression Pattern Sentence Output Nouns describing agents ($gene (is)? (the|an|a) @{0,2}$action of @{0,2} $gene) IL6, a known mediator of STAT3 response Nouns describing actions ($gene @{0,6} $action (of|with) @{0,1} $gene) Passive verbs

($gene @{0.6} (is|was|be|are|were) @{0,1} $action $(by|via|through) @{0,3} $gene) abi5 domains required for interaction with abi3 Protein kinase c (PKC) has been shown to be activated by parathyroid hormone Active verbs ($gene $sub-action @{0,1} $action @{0,2} $gene) Insulin mediated inhibition of hormone sensitivity lipase activity Text Mining: Keyword Extraction from Biomedical Literature Keyword extraction Example Sentence: Brill-POS-tagged sentence:

associates, stimulates, transcription activity Sentence keywords after manual curation: 9 BRCA1/NNP physically/RB associates/VBZ with/IN p53/NN and/CC stimulates/VBZ its/PRP$ transcriptional/JJ activity/NN ./. Sentence keywords: BRCA1 physically associates with p53 and stimulates its transcriptional activity. transcription activity Text Mining: MeSH Keyword Extraction MeSH keywords MeSH keyword extraction

10 Extracted directly from gene specific abstracts via Perl scripts MeSH keyword curation MeSH keywords are subject index terms assigned to each scientific literature by the Natural Library of Medicine (NLM) for purpose of subject indexing and searching the journal articles via PubMed. Using a MeSH keywords stop words dictionary (e.g., human, DNA, animal, Support U.S Govt etc.). For example the MeSH keywords associated with a gene FOS in our gene list are oncogene, felypressin, transcription-factor, thermo-receptors, DNA-binding, antibiosis, inflammatoryresponse, zinc-fingers, gene-regulation, and neuronal-plasticity. Text Mining: GO Keyword Extraction GO keywords GO keyword extraction

11 Gene Ontology (GO) is a hierarchical organization of gene and gene product terms from various databases in which concepts at higher levels in the hierarchy are more general than those further down Out of the three GO annotation categories we included only molecular function and biological process and left out cellular component as it is less important for characterizing genes functions Further, due to hierarchical nature of GO and multiple inheritance in the GO structure, we consider with every ancestor up to the level 2 in the GO tree For example the GO keywords associated with the gene FOS in our gene list are protein-dimerization, DNA binding, RNA polymerase, transcription factor, DNA methylation, and inflammatory-response. Text Mining: Keyword Representation and Calculation of Numeric Vectors This process is concerned with computing the numeric weight, wij, for each gene-keyword pair (gi, tj) (i = 1, 2, n and j = 1, 2, k) to represent the genes characteristics in terms of the associated keywords. Common techniques for such numeric encoding include

12 Binary. The presence or absence of a keyword relative to a gene. Term frequency. The frequency of occurrence of a keyword with a gene. Term frequency / inverse document frequency (TF*IDF). The relative frequency of occurrence of a keyword with a gene compared to other genes Text Mining: TF*IDF Weighting Most weighting scheme in information retrieval and text classification method is the TFIDF (term frequency / inverse document frequency) weighting scheme. TF(w,d) (Term Frequency) is the number of times word w occurs in a document d. DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once. The inverse document frequency is calculated as IDF ( w) log(

13 | D| DF ( w ) ) Where | D | is total number of documents in the corpus Text Mining: Keyword Representation and Calculation of Numeric vectors In our study, as the keywords are extracted from gene specific sentences but not from full abstracts, the number of keywords associated with each gene is small. Further, the frequency of occurance of most keywords tended be one. Therefore, the binary encoding scheme was adopted as illustrated in Table 2 . Table 2. Binary representation of gene * keywords Genes / Terms g1 g2 ...

gn 14 t1 w11 = 0 w12 = 1 t2 w21 = 1 w22 = 1 ... ... w1n = 0 w2n = 0 ... ... ... ... ... tk wk1 = 1 wk2 = 0 ... wkn = 1 Text Mining: Gene Clustering

After, our binary coding scheme adopted in this study consists of numeric row vectors representing genes (via the associated biological functional keywords), and numeric column vectors representing annotation terms (via the associated genes) Clustering can produce useful and specific information about the biological characteristics of sets of genes Clustering: Partition unlabeled examples into disjoint subsets of clusters, such that: 15 Examples within a cluster are very similar Examples in different clusters are very different Discover new categories in an unsupervised manner. Text Mining: Test Set and Evaluation The test set contains 20 genes and 10 abstracts for each gene, resulting in a total of 200 abstracts in two cancer categories (Table

3) was used evaluate usefulness of our keyword extraction method Table 3. Test set of 20 human genes manually grouped in to two cancer categories 16 Genes Category ADAM23, DKK1, IGF2, LRRC4, L3MBTL, MMP9, MSH2, PTPNS1, SFMBT1, ZIC1 Brain Tumor AMPH, ATM, BRCA1, BRCA2, CHEK2, CDH1, PHB, TFF1, TSG101, XRCC3 Breast Cancer Text Mining: Evaluation 17

Full abstract keywords (baseline). Extracts gene annotation terms based on term frequencies * inverse document frequencies (TF*IDF) within the entire abstract without regard to sentence structure. Sentence keywords. Extracts gene specific keywords based sentence-level processing. Sentence + MeSH keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction). Sentence + MeSH + GO keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction) and GO terms (see Section GO keyword extraction Text Mining: Evaluation Results of various keyword extraction methods Keywords Extraction Method 18 Precisi on Recall F-measure (%) Abstract keywords (baseline) 0.31

0.24 27.05 Sentence keywords only 0.57 0.38 45.60 Sentence + MeSH keywords 0.64 0.47 54.19 Sentence + MeSH + GO keywords 0.78 0.72 74.88 Part II: Applications to Microarrays Functional keyword Clustering of genes resulting from microarray experiment

Applications to Microarrays Data and Analysis 20 As an illustrative example, our keyword extraction methods was applied to functional interpretation of cluster of genes that were found differentially expressed in a microarray experiment investigating the impact of two mitogenic protein Epidermal growth factor (EGF) and Sphingosine 1-phosphate (S1P) on glioblastoma cell lines when compared to the resting state, 19 genes were significantly differentially expressed as a response to EGF, 35 genes as a response to S1P and 30 genes as a response to COM, i.e., combined stimuli of S1P and EGF. The three gene lists are referred to as G(EGF), G(S1P) and G(COM), respectively (Table 4). Applications to Microarrays Data and Analysis Table 4. List of Differentially Expressed Genes Gene List 21 Name of Genes G(EGF) (19 genes)

HRY, KLF2, ID1, JUN, DUSP6, IMPDH2, GP1BB, PNUTL1, CGI-96, CALD1, TRIM15, FOS, SPRY4, CLU, SLC5A3, MRPS6, ABCA1, OLFM1, PHLDA1 G(S1P) (35 genes) F3, NR4A1, KLF5, GADD45B, IL8, CITED2, CALD1, IL6, BCL6, LBH, HRB2, KIAA0992, NFKBIA, TNFAIP3, CCL2, DSCR1, TXNIP, NAB1, EHD1, GBP1, GLIPR1, MAP2K3, FZD7, RGS3, SOCS5, FOSL2, JAG1, DOC1, NRG1, BTG1, PDE4C, KIAA1718, KIAA0346, SFRS3, PLAU G(COM) (30 genes) MAFF, DUSP5, EGR3, SERPINE1, ZFP36, DUSP1, LIF, DTR, MYC, GADD45B, RTP801, ATF3, JUNB, SNARK, WEE1, EGR2, TIEG, SPRY2, CEBPD, SGK, GEM, NEDD9, LDLR, EGR1, C8FW, UGCG, MCL1, ZYX, FOSL1, DIPA Applications to Microarrays Data and Analysis 22 Using these the three gene lists obtained from the microarray experiment (Table 6) as query in MEDLINE returned the three corresponding sets of abstracts

A(EGF), A(S1P) and A(COM), respectively (Table 5). The abstracts were processed with the keyword extraction method involving sentence-level augmented with MeSH and GO keywords The resulting keywords were encoded in binary weighting scheme The resulting representations were clustered using average linkage hierarchical clustering algorithm. Applications to Microarrays Data and Analysis Table 5. Three sets of abstracts, A(EGF), A(S1P), and A(COM), retrieved via MEDLINE for this study 23 Gene List # of Genes in List Retrieved Abstract Set # of Abstracts in Set G(EGF) 19 A(EGF) 28 913

G(S1P) 35 A(S1P) 19 705 G(COM) 30 A(COM) 39 890 Applications to Microarrays Average Linkage Hierarchical Clustering Algorithm Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. sim(ci , c j ) 24 1 ci c j ( ci c j 1)

sim ( x , y) x( ci c j ) y( ci c j ): y x Compromise between single and complete link. Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters. T R IM 1 5 G P1BB SPR Y4 M R PS6 IM P D H 2 PN U TL1 O LFM 1 PH LD A1 DU SP6 ID 1 KLF2 C ALD 1 ABC A1 CLU

FO S JU N SLC 5A3 HR Y 25 a n tib io s is o s te o b la s ts v -fo s fu s io n s e n s a tio n im m u n o - r e a c tiv ity r e c o m b in a t io n th e rm o -re c e p to rs t r a n s it io n c lu s te r in g in t r a c e llu la r a th e r o g e n e s e s g lu ta m in e - tr a n s p o r t D N A - m e th y la tio n fe ly p r e s s in r e la x a tio n tu m o r ig e n e s is d e s a tu ra s e s s h a p e - r e g u la t io n a s s e m b le s e c r e t io n b io s y n th e s is

r e g u la tio n g ly c o p r o t e in a n d ro g e n s o d o n to g e n e s is c a lm o d u lin - b in d in g t r a n s - a c tiv a to r s z in c fin g e r s m it o g e n e s is in h ib it io n e m b r y o n ic d e v e lo p m e n t n e u ra l tu b e d e fe c ts t r a n s c r ip tio n fa c to r c e ll d e a th e m b r y o g e n e s is io n b in d in g a n g io g e n e s is Applications to Microarrays Results Summary of analysis of EGF cluster a th e ro g e n e s is m it o g e n e s is a s s e m b le in f la m m a t io n a n g io g e n e s is e n d o c y t o s is ly m p h o c y t e s p a t h o g e n e s is im m u n e - r e s p o n s e

D N A -d e p e n d e n t f o c a l- c o n ta c t D N A -d a m a g e s p li c i n g G 1 phase e x t r a c e llu la r m o t ilit y p r o t e in - b in d in g c o s - c e lls m y o s in R N A lo c a liz a t io n d o s e -re s p o n s e a n tic o d o n c y t o t o x ic it y p a r a s it o p h o r o u s G p r o te in d e m y e lin a t io n c y to ly s is C a r e le a s e lo c o m o tio n h o m e o s ta s is c ir c u la t io n p h o s p h o r y la tio n s y n th e s is r e p a ir p r o te in k in a s e e n d o t h e lia liz a t io n o r g a n o g e n e s is c e ll- a d h e s io n m u ta g e n e s is Applications to Microarrays Results

Summary of analysis of S1P cluster T N F A IP 3 KLF5 BC L6 NAB1 BTG 1 N F K B IA NR 4A1 SO CS5 C IT E D 2 NR G 1 JA G 1 PLAU CC L2 IL 8 IL 6 G L IP R 1 F3 M AP2K3 EH D1 G BP1 DSCR 1 HR B2 G ADD 45B FO SL2 PD E4C RG S3 FZD 7 SFR S3 T X N IP DO C 1 CALD1 26

27 LD LR SPR Y2 G EM ZYX N ED D 9 M YC L IF S E R P IN E 1 DTR M C L1 C 8FW M AFF ATF3 R TP801 EG R 1 JU N B FO SL1 C EBPD T IE G EG R 2 EG R 3 ZFP36 W EE1 SN AR K SG K G AD D 45B D U SP1 D U SP5 U G C G D IP A D N A m o d if ic a t io n D N A m e t h y la t io n ju n g e n e s

G 2 - m tr a n s it io n m R N A s p lic in g im m o r ta lity D N A r e c o m b in a t io n m ic r o t u b u le g e n e s ile n c in g h e lix - lo o p - h e lix m o t if s t r a n s c r ip t io n f a c t o r s e iz u re s g e n o m e in s t a b ilit y o x id a t iv e s t r e s s p ro to -o n c o g e n e c e ll s u r v iv a l s ig n a l tr a n s d u c tio n m a t u r a t io n e n d o c y to s is d iff e r e n t ia t io n m it o g e n e s is m it o s is G 2 phase c h e m o s e n s it iv it y m u t a g e n e s is ly m p h a n g io g e n e s is io n b in d in g R N A p r o c e s s in g D N A - b in d in g z in c f in g e r s re p re s s o r p ro te in s D N A -d e p e n d e n t n u c le u s t r a n s a c t iv a t io n le u c in e z ip p e r s tr a n s c r ip tio n

g e n e e x p r e s s io n r e g u la t io n Applications to Microarrays Results Summary of analysis of COM cluster Conclusions 28 An important topic in microarray data mining is to bind transcriptionally modulated genes to functional pathways or how transcriptional modulation can be associated with specific biological events such as genetic disease phenotype, cell differentiation etc. However, the amount of functional annotation available with each transcriptionaly modulated genes is still a limiting factor because not all genes are well annotated Further, Jenssen et al. (2001) earlier compiled a network of human gene relationships from MEDLINE abstracts. These compiled relationships were then compared to the gene expression cluster results. This approach gives a very interesting result: functionally related genes can show totally different patterns, and hence belong to different clusters (Jenssen, et al.: A literature network of human genes for high-throughput analysis of gene expression, Nat.Genet., 28, 21-28, 2001)

Conclusions Our gene functional keyword clustering/ grouping will enable to select functionally informative genes from differentially expressed genes for further investigations. Our evaluation suggests that this approach will provide more specific and useful information than typical approaches using abstract-level information. This is particularly the case when the sentence-level terms are augmented by MeSH and GO keywords As the current text mining scenario is on full-text mining As full-text contains large number of irreverent sentences compare to abstracts this approach is more appropriate for full-text study as it filters irrelevant sentences before clustering. 29 Acknowledgments

30 Eric G. Bremer, Brain Tumor Research Program, Childrens Memorial Research Center, Chicago, IL, USA, and James R. van Brocklyn, Division of Neuropathology, Department of Pathology, The Ohio State University, Columbus, Ohio, USA for the microarray data set Dr. Daniel Berrar, Bioinformatics Research Group, University of Ulster, UK Members of Bioinformatics Centre, Madurai Kamaraj University, India Dept of Biotechnology, Govt. of India for Bioinformatics facilities THANK YOU 31

Recently Viewed Presentations

  • No Discharge Zones 101 U.S. Environmental Protection Agency

    No Discharge Zones 101 U.S. Environmental Protection Agency

    Presented by: Kelsey Watts FitzGerald. ORISE Research Fellow, Marine Pollution Control Branch. Luisa Wink. ... Connection between coral disease and sewage contamination on reefs (Sutherland et al., 2010) Nutrients from sewage harm seagrass beds (CabaƧoet al., 2008)
  • Symbiotic Relationships - Effingham County School District

    Symbiotic Relationships - Effingham County School District

    Symbiotic Relationships Symbiotic Relationships symbiosis - the relationship between two different individual species that live together in a close relationship (living together) The word symbiosis can be broken down into two parts to determine its meaning sym means together bio...
  • Nehemiah Nehemiah Gods ( ) Plan for Gods

    Nehemiah Nehemiah Gods ( ) Plan for Gods

    The USS Enterprise is in an orbit around Susa, in the Galaxy of Persia. First Officer Spock [Hanani] brings Captain Kirk [Nehemiah] news from Earth that the city of Washington, DC is in a state of great dispair and the...
  • Baldwin & Wyplosz The Economics of Euroepan Integration ...

    Baldwin & Wyplosz The Economics of Euroepan Integration ...

    some members have very restrictive takeover practices, makes M&As very difficult. others, UK, very liberal rules. Lack of harmonisation means restructuring effects very impact by member states. ... Baldwin & Wyplosz The Economics of Euroepan Integration Chapter 3: The Economics...
  • Environmental enforcement along the Texas Border Francisco J.

    Environmental enforcement along the Texas Border Francisco J.

    Photo 1: This photograph was taken at the City of Roma surface water intake. This photo illustrates the potential for contamination from run-off and discharges from both sides. The river is exposed and it is sensitive. Photo 2: This photo...
  • Introduction to Shibboleth and the IAMSECT Project

    Introduction to Shibboleth and the IAMSECT Project

    Introduction to Shibboleth and the IAMSECT Project What is Shibboleth? Authentication management Authorisation management (Open Source) Software A decentralised, key-based trust model Web-based Overview Users and Services, now Users and Services, with Shibboleth ID Providers The IAMSECT Project Demonstration #1...
  • Ozone/CFCs

    Ozone/CFCs

    Word equations and skeleton equations provide important information about a chemical reaction. A chemical equation gives the identities and relative amounts of the reactants and products that are involved in a chemical reaction.
  • HBS 2.2.4 Whats the difference?  http://www.youtube.com/watch?v=FrZVRuK7 7EE   http://www.youtube.com/watch?NR=1&v=1k

    HBS 2.2.4 Whats the difference? http://www.youtube.com/watch?v=FrZVRuK7 7EE http://www.youtube.com/watch?NR=1&v=1k

    This action is controlled by the human nervous system and is considered an involuntary response - it happens without our even having to think about it. On the other hand, voluntary responses occur when we have to think to react....