Overview of Chemical Informatics and Cyberinfrastructure Collaboratory March

Overview of Chemical Informatics and Cyberinfrastructure Collaboratory March

Overview of Chemical Informatics and Cyberinfrastructure Collaboratory March 15 2007 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 [email protected] http://www.infomall.org http://www.chembiogrid.org 1 Indiana University Summary Indiana University is focusing on two major areas: Creating a comprehensive, easily accessible infrastructure for chemoinformatics tools and data sources, linked with PubChem and made available as web services, and partnering with screening centers and other users to demonstrate how this infrastructure can be usefully applied Infrastructure can include any tools, not just ours (commercial/open source, chemoinformatics, bioinformatics, and so on) New, custom applications can be built quickly using existing services in a similar way to Google Maps and other web 2.0 resources Being a central hub of chemoinformatics education, including offering distance courses on chemoinformatics theory and techniques, practical workshops on using chemoinformatics resources, and freely available web-based educational resources We currently offer a Ph.D, M.S. and graduate certificate (distance) in chemical informatics Distance education program allows you to pick and choose courses to meet educational needs: certificate is awarded on completion of four courses CICC Senior Personnel

Geoffrey C. Fox Mu-Hyun (Mookie) Baik Dennis B. Gannon Kevin E. Gilbert Rajarshi Guha Marlon Pierce Beth A. Plale Gary D. Wiggins David J. Wild Yuqing (Melanie) Wu From Biology, Chemistry, Computer Science, Informatics at IU Bloomington and IUPUI (Indianapolis)

Peter T. Cherbas Mehmet M. Dalkilic Charles H. Davis A. Keith Dunker Kelsey M. Forsythe John C. Huffman Malika Mahoui Daniel J. Mindiola Santiago D. Schnell William Scott Craig A. Stewart David R. Williams 3 CICC Chemical Informatics and Cyberinfrastucture Collaboratory Funded by the National Institutes of Health www.chembiogrid.org CICC CICC Combines Grid Computing with Chemical Informatics Large Scale Computing Challenges Chemical Informatics is non-traditional area of high performance computing, but many new, challenging problems may be investigated. NIH PubMed DataBase Chemical informatics text analysis programs can process

100,000s of abstracts of online journal articles to extract chemical signatures of potential drugs. OSCAR Text Analysis Initial 3D Structure Calculation Molecular Mechanics Calculations Cluster Grouping Toxicity Filtering Science and Cyberinfrastructure CICC is an NIH funded project to support chemical informatics needs of High Throughput Cancer Screening Centers. The NIH is creating a data deluge of publicly available data on potential new drugs. . Docking OSCAR-mined molecular signatures can

be clustered, filtered for toxicity, and docked onto larger proteins. These are classic pleasingly parallel tasks. Topranking docked molecules can be further examined for drug potential. Quantum Mechanics Calculations NIH PubChem DataBase POVRay Parallel Rendering IUs Varuna DataBase Big Red (and the TeraGrid) will also enable us to perform time consuming, multi-stepped Quantum Chemistry calculations on all of PubMed. Results go back to public databases that are freely accessible by the scientific community. CICC supports the NIH mission by combining state of the art chemical informatics techniques with World class high performance computing National-scale computing resources (TeraGrid) Internet-standard web services International activities for service orchestration Open distributed computing infrastructure for scientists

world wide Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories 4 CICC Web Service Infrastructure Cheminformatics Services Statistics Services Database Services Core functionality Fingerprints Similarity Descriptors 2D diagrams File format conversion Computation functionality Regression Classification Clustering Sampling distributions 3D structures by CID SMARTS 3D Similarity Docking scores/poses by CID SMARTS Protein Docking scores

Applications Applications Docking Predictive models Filtering Feature selection Druglikeness 2D plots Toxicity predictions Arbitrary R code (PkCell) Mutagenecity predictions PubChem related data by Anti-cancer activity predictions Pharmacokinetic parameters CID, SMARTS OSCAR Document Analysis InChI Generation/Search Computational Chemistry (Gamess, Jaguar etc.) Grid Services Varuna.net Quantum Chemistry Portal Services Service Registry Job Submission and Management Local Clusters IU Big Red TeraGrid, Open Science Grid RSS Feeds User Profiles Collaboration as in Sakai Web Service Locations

Cambridge University InChi generation / search CMLRSS OpenBabel Indiana University Clustering VOTables OSCAR3classification Toxicity Toxicity classification Database services Databaseservices Statistics services VCC Laboratory ALogPS NCI CSLS University of Cologne NMRShiftDB Where Does The Functionality Come From? University of

Michigan PkCell gNova Consulting DigitalChemistry BCI fingerprints DivKMeans Cambridge University InChi generation / search OSCAR NIH PubChem PubMed CDK Cheminformatics European Chemicals Bureau ToxTree toxicity predictions OpenEye Docking Indiana University VOTables

NCI DTP predictions Database services R Foundation R package CICC Infrastructure Vision Drug Discovery and other academic chemistry and pharmacology research will be aided by powerful modern information technology ChemBioGrid set up as distributed cyberinfrastructure in eScience model ChemBioGrid will provide portals (user interfaces) to distributed databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses ChemBioGrid will provide services to manipulate this data and combine in workflows; it will have convenient ways to submit and manage multiple jobs ChemBioGrid will include access to PubChem, PubMed, PubMed Central, the Internet and its derivatives like Microsoft Academic Live and Google Scholar The services include open-source software like CDK, commercial code from vendors from BCI, OpenEye, Gaussian and Google, and any user contributed programs ChemBioGrid will define open interfaces to use for a particular type of service allowing plug and play choice between different implementations

8 Cheminformatics Education at IU Linked to bioinformatics in Indiana Universitys School of Informatics School of Informatics degree programs BS, MS, PhD Programs offered at both the Indianapolis (IUPUI) and Bloomington (IUB) campuses Bioinformatics MS and track on PhD Chemical Informatics MS and track on PhD Informatics Undergraduates can choose a chemistry cognate (change to Life Sciences ) PhD in Informatics started in August 2005 and offers tracks in bioinformatics; chemical informatics; health informatics; humancomputer interaction design; social and organizational informatics; more to come! Good employer interest but modest student understanding of value of Cheminformatics degree 3 core courses in Cheminformatics plus seminar/independent studies Significant interest in distance education version of introductory Cheminformatics course (enrollment promising in Distance Graduate Certificate in Chemical Informatics) 9 Example: Spreading chemoinformatics education with CIC courseshare

We have partnered with the University of Michigan to offer our introductory chemoinformatics (I571) course concurrently at Indiana University and the University of Michigan as a CIC courseshare, so UM pharmacy, chemistry and engineering students can be trained in chemoinformatics techniques for course credit at UM In addition, individual students in academia, government, and small and large life science companies have taken the class remotely from all over the country for credit towards the graduate certificate Uses mixture of web conferencing (Breeze), videoconferencing, and online resources for maximum flexibility Minimally all that is required is a telephone and internet-connected PC Students can replay any of the classes using just a regular PC Most recent course wiki is available at http://cheminfo.informatics.indiana.edu/dj wild/I571_2006_wiki Giving a class remotely to UM students with video and web conferencing MLSCN Post-HTS Biology Decision Support Percent Inhibition or IC50 data is retrieved from HTS Question: Was this screen

successful? Question: What should the active/inactive cutoffs be? Question: What can we learn about the target protein or cell line from this screen? Compounds submitted to PubChem PROCESS Workflows encoding plate & control well statistics, distribution analysis, etc Workflows encoding distribution analysis of screening results Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc CHEMINFORMATICS Grids can link data

analysis ( e.g image processing developed in existing Grids), traditional Cheminformatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis A Grid of Grids linking collections of services at PubChem ECCRGRIDS centers 11 MLSCN centers Example HTS workflow: finding cell-protein relationships A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex) The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand. Docking results and activity patterns fed into R services for building of activity models and

correlations Least Squares Regressio n Similar structures to the ligand can be browsed using client portlets. Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. Random Forests Neural Nets Once docking is complete, the user visualizes the highscoring docked structures in a portlet using the JMOL applet. 12 Example: PubDock

Database of approximately 1 million PubChem structures (the most druglike) docked into proteins taken from the PDB Available as a web service, so structures can be accessed in your own programs, or using workflow tools like Pipeline Polit Several interfaces developed, including one based on Chimera (below) which integrates the database with the PDB to allow browsing of compounds in different targets, or different compounds in the same target Can be used as a tool to help understand molecular basis of activity in cellular or image based assays Example: R Statistics applied to PubChem data By exposing the R statistical package, and the Chemistry Development Kit (CDK) toolkit as web services and integrating them with PubChem, we can quickly and easily perform statistical analysis and virtual screening of PubChem assay data Predictive models for particular screens are exposed as web services, and can be used either as simple web tools or integrated into other applications Example below uses DTP Tumor Cell Line screens - a predictive model using Random Forests in R makes predictions of probability of activity across multiple cell lines (available at http://rguha.ath.cx/~rguha/ncidtp/dtp) Varuna environment for molecular modeling (Baik, IU) Chemical Concepts Researcher Papers

etc. ChemBioGrid Experiments Reaction DB QM Database PubChem, PDB, NCI, etc. DB Service Queries, Clustering, Curation, etc. QM/MM Database Simulation Service FORTRAN Code, Scripts Condor TeraGrid Supercomputers Flocks 15 Methods Development at the CICC

Tagging methods for web-based annotation exploiting del.icio.us and Connotea Development of QSAR model interpretability and applicability methods RNN-Profiles for exploration of chemical spaces VisualiSAR - SAR through visual analysis See http://www.daylight.com/meetings/mug99/Wild/Mug99.html Visual Similarity Matrices for High Volume Datasets See http://www.osl.iu.edu/~chemuell/new/bioinformatics.php Fast, accurate clustering using parallel Divisive K-means Mapping of Natural Language queries to use cases and workflows Advanced data mining models for drug discovery information Physics-based Scoring Algorithms 16 What Do You Get in a Web Service? WSDL for all services available Collected on a web page Available in a UDDI repository Javadocs or plain text descriptions

Source code and associated unit tests Various client examples Web pages (via PHP) Python Chimera Web Service Vision Web services provide a neutral approach to exposing functionality You can utilize them in Workflow tools Pipeline Pilot, Taverna, XBaya Desktop clients Chimera, custom

Web pages They can be located anywhere On your desktop Intranet Internet Web Service Vision Literally anything can be made into a web service Libraries Standalone programs Commerical code Open-source code

RSS Feeds Provide access to DB's via RSS feeds Feeds include 2D/3D structures in CML Viewable in Bioclipse, Jmol as well as Sage etc. Two feeds currently available SynSearch get structures based on full or partial chemical names DockSearch get best N structures for a target R, CDK & PubChem Goals Access cheminformatics from within R

Access PubChem data from within R rcdk package allows to do cheminformatics within R using CDK functionality rpubchem provides access to PubChem compound data and bioassay data Searchable via assay ID, keywords J. Stat. Soft, 2007, 18(6) Databases Most of our databases aim to add value to PubChem or link into PubChem 3D structures (MMFF94) We maintain a local mirror for testing, data mining Searchable by CID, SMARTS, 3D similarity Docked ligands (FRED) 906K drug-like compounds into 7 ligands

Will eventually cover ~2000 targets (Cheminformatics) Algorithm Development Goals Focus on interpretability and applicability Devise novel approaches to clustering problems Investigate the utility of low dimensional representations for a variety of problems Examples Ensemble feature selection (JCIM, in press) Cluster counting with R-NN curves (in revision) Chemical Data Mining

Working on screening data with Scripps, FL Random forests (modeling & feature selection) Nave Bayes (modeling) Identifying features indicative of toxicity Domain applicability NCI DTP Cell line activity predictions Random forest models for 60 cell lines All available as downloadable R models web services (supply SMILES, get prediction) with web page clients

Recently Viewed Presentations

  • Democracy and the Middle Ages - Oak Park Independent

    Democracy and the Middle Ages - Oak Park Independent

    Democracy and the Middle Ages. Topics. Democracy and The 4 R's. Greek Contribution to Democracy. ... Fiefs came with Peasants to work the land. In exchange Lords promised to protect peasants. The Crusades. Pope Urban II. Calls for the first...
  • Parochial Employees' Retirement System

    Parochial Employees' Retirement System

    Arial Wingdings Times New Roman Arial Black Radial 1_Radial Parochial Employees' Retirement System Investment Overview What is "asset allocation"? PERS Asset Allocation 8/31/2014 PERS Equity Allocation 8/31/14 PERS Fixed Income Allocation 8/31/14 PERS Alternatives Allocation 8/31/14 Investment Performance Current Economic...
  • The Aegean Bronze Age

    The Aegean Bronze Age

    Unknown period of history of Greece from end of Mycenaean civilization to Classical civilization-approximately 1100-700 BCE Steady, unbroken progression and evolution in art from about 700 BCE to about 400 CE
  • NOVEMBER 29, 2018 Managing Chronic Pain and Addiction

    NOVEMBER 29, 2018 Managing Chronic Pain and Addiction

    State of the Science: Chronic pain in cancer survivors on long-term opioid therapy. National palliative care provider survey. Asked about cancer survivors with chronic pain NOT at end of life. N=169, mostly physicians (83%) and nurse practitioners (15%) Opioid misuse/addiction:...
  • P-Card _Card User Verification Instructions

    P-Card _Card User Verification Instructions

    A message will appear at the bottom of the screen that states "Card Document _____ Saved. This means that the charges have been successfully verified. ... contact your system administrator. Procurement Card/DCPS Card User Verification/ ... face up in to...
  • Introduction to Multimedia Systems - Birmingham

    Introduction to Multimedia Systems - Birmingham

    Introduction to Multimedia Systems. ... Definition: Multimedia Systems. The presentation of a computer application, usually interactive, that incorporates media elements such as text, graphics, video, animation, and sound on a computer. ... ISDN digital modem at speed 128 000 bit/second....
  • Think Fast, Search Fast Ovid  2015  5   1

    Think Fast, Search Fast Ovid 2015 5 1

    如果你通常使用账号和密码登录数据库,或者无法接入你单位的公共无线网,点击"Login using your OvidSP ID and Password" 的链接。 *如果你不知道你单位访问Ovid数据库的账号和密码,请联系你单位的图书馆员。 ... HTA:Commentary and ...
  • Molecular Geometry and Polarity

    Molecular Geometry and Polarity

    Electric field OFF Electric field ON Polarity of Molecules For a molecule to be polar it must have polar bonds electronegativity difference - theory bond dipole moments - measured have an unsymmetrical shape vector addition Nonbonding pairs affect molecular polarity,...