EFFECTIVE BIO-EVENT EXTRACTION USING TRIGGER WORDS AND SYNTACTIC DEPENDENCIES

Artigo Acesso aberto Revisado por pares

EFFECTIVE BIO-EVENT EXTRACTION USING TRIGGER WORDS AND SYNTACTIC DEPENDENCIES

2011; Wiley; Volume: 27; Issue: 4 Linguagem: Inglês

10.1111/j.1467-8640.2011.00401.x

ISSN

1467-8640

Autores

Halil Kilicoglu, Sabine Bergler,

Tópico(s)

Natural Language Processing Techniques

Resumo

Computational IntelligenceVolume 27, Issue 4 p. 583-609 EFFECTIVE BIO-EVENT EXTRACTION USING TRIGGER WORDS AND SYNTACTIC DEPENDENCIES Halil Kilicoglu, Halil Kilicoglu Department of Computer Science and Software Engineering, Concordia University, Montréal, CanadaSearch for more papers by this authorSabine Bergler, Sabine Bergler Department of Computer Science and Software Engineering, Concordia University, Montréal, CanadaSearch for more papers by this author Halil Kilicoglu, Halil Kilicoglu Department of Computer Science and Software Engineering, Concordia University, Montréal, CanadaSearch for more papers by this authorSabine Bergler, Sabine Bergler Department of Computer Science and Software Engineering, Concordia University, Montréal, CanadaSearch for more papers by this author First published: 27 November 2011 https://doi.org/10.1111/j.1467-8640.2011.00401.xCitations: 13 Halil Kilicoglu, Concordia University, Department of Computer Science and Software Engineering, 1455 de Maisonneuve Blvd West, Montréal, QC H3G 1M8, Canada; e-mail: [email protected] Read the full textAboutPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onEmailFacebookTwitterLinkedInRedditWechat Abstract The scientific literature is the main source for comprehensive, up-to-date biological knowledge. Automatic extraction of this knowledge facilitates core biological tasks, such as database curation and knowledge discovery. We present here a linguistically inspired, rule-based and syntax-driven methodology for biological event extraction. We rely on a dictionary of trigger words to detect and characterize event expressions and syntactic dependency based heuristics to extract their event arguments. We refine and extend our prior work to recognize speculated and negated events. We show that heuristics based on syntactic dependencies, used to identify event arguments, extend naturally to also identify speculation and negation scope. In the BioNLP'09 Shared Task on Event Extraction, our system placed third in the Core Event Extraction Task (F-score of 0.4462), and first in the Speculation and Negation Task (F-score of 0.4252). Of particular interest is the extraction of complex regulatory events, where it scored second place. Our system significantly outperformed other participating systems in detecting speculation and negation. These results demonstrate the utility of a syntax-driven approach. In this article, we also report on our more recent work on supervised learning of event trigger expressions and discuss event annotation issues, based on our corpus analysis. REFERENCES Ahlers, C. B., M. Fiszman, D. Demner-Fushman, F. M. Lang, and T. C. Rindflesch. 2007. Extracting semantic predications from Medline citations for pharmacogenomics. In Pacific Symposium on Biocomputing, Maui, HI, pp. 209–220. Airola, A., S. Pyysalo, J. Björne, T. Pahikkala, F. Ginter, and T. Salakoski. 2008. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics, 9(Suppl 11): s2. Averbuch, M., T. Karson, B. Ben-Ami, O. Maimon, and L. Rokach. 2004. Context-sensitive medical information retrieval. In Proceedings of MEDINFO-2004, San Francisco , pp. 1–8. Bies, A. (1995). Bracketing guidelines for treebank II style penn treebank project. Technical report. Björne, J., S. Pyysalo, F. Ginter, and T. Salakoski. 2008. Extracting protein-protein interactions from text using rich feature vectors and feature selection. In Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku , Finland , pp. 125–128. Blaschke, C., and A. Valencia. 2001. The potential use of SUISEKI as a protein interaction discovery tool. Genome Informatics, 12: 123–134. Blaschke, C., M. A. Andrade, C. Ouzounis, and A. Valencia. 1999. Automatic extraction of biological information from scientific text: Protein-protein interactions. In Proceedings of ISMB 1999, Heidelberg , Germany , pp. 60–67. Bunescu, R., R. Mooney, A. Ramani, and E. Marcotte. 2006. Integrating co-occurrence statistics with information extraction for robust retrieval of protein interactions from Medline. In Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology, New York, pp. 49–56. Castaño, J., J. Zhang, and J. Pustejovsky. 2002. Anaphora resolution in biomedical literature. In Proceedings of the International Symposium on Reference Resolution for NLP, Alicante , Spain . Chapman, W. W., W. Bridewell, P. Hanbury, G. F. Cooper, and B. G. Buchanan. 2001. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics, 34(5): 301–310. Charniak, E., and M. Johnson. 2005. Coarse-to-fine n-best parsing and maxent discriminative reranking. In Proceedings of the 43rd Meeting of the Association for Computational Linguistics, Ann Arbor, MI, pp. 173–180. Clegg, A., and A. J. Shepherd. 2007. Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinformatics, 8: 24. Cohen, K. B., and L. Hunter. 2004. Natural language processing and systems biology. In W. Dubitzky and F. Azuaje, editors, Artificial Intelligence Methods and Tools for Systems Biology, pp. 147–174. Springer: Norwell , MA . Daraselia, N., A. Yuryev, S. Egorov, S. Novichkova, A. Nikitin, and I. Mazo. 2004. Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics, 20(5): 604–611. DeMarneffe, M. C., B. MacCartney, and C. D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy, pp. 449–454. Ding, J., D. Berleant, D. Nettleton, and E. S. Wurtele. 2002. Mining MEDLINE: Abstracts, sentences, or phrases? In Pacific Symposium on Biocomputing, Lihue, HI, pp. 326–337. Elkin, P. L., S. H. Brown, B. A. Bauer, C. S. Husser, W. Carruth, L. R. Bergstrom, and D. L. Wahner-Roedler. 2005. A controlled trial of automated classification of negation from clinical notes. BMC Medical Informatics and Decision Making, 5:13+. Friedman, C., P. Kra, M. Krauthammer, H. Yu, and A. Rzhetsky. 2001. GENIES: A natural-langauge processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17(1): 74–82. Friedman, C., P. Kra, and A. Rzhetsky. 2002. Two biomedical sublanguages: A description based on the theories of Zellig Harris. Journal of Biomedical Informatics, 35: 222–235. Fundel, K., R. Küffner, and R. Zimmer. 2007. RelEx—Relation extraction using dependency parse trees. Bioinformatics, 23(3): 365–371. Gasperin, C., and T. Briscoe. 2008. Statistical anaphora resolution in biomedical texts. In Proceedings of COLING 2008, Manchester , UK , pp. 257–264. Goldin, I. M., and W. W. Chapman. 2003. Learning to detect negation with 'not' in medical texts. In Proceedings of the Workshop on Text Analysis and Search for Bioinformatics at the 26th ACM SIGIR Conference, Toronto , Canada . Guthrie, L., B. M. Slator, Y. Wilks, and R. Bruce. 1990. Is there content in empty heads?. In Proceedings of the 13th Conference on Computational Linguistics, Vol. 3, pp. 138–143. Stroudsburg , PA : Association for Computational Linguistics. Hirschman, L., A. Yeh, C. Blaschke, and A. Valencia. 2005. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1). Huang, Y., and H. J. Lowe. 2007. A novel hybrid approach to automated negation detection in clinical radiology reports. Journal of the American Medical Informatics Association, 14(3): 304–311. Jenssen, T. K., A. Laegreid, J. Komorowski, and E. Hovig. 2001. A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics, 28: 21–28. Kilicoglu, H., and S. Bergler. 2008. Recognizing speculative language in biomedical research articles: A linguistically motivated perspective. BMC Bioinformatics, 9(Suppl 11): s10. Kim, J.-D., T. Ohta, and J. Tsujii. 2008. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics, 9: 10. Kim, J.-D., T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii. 2009. Overview of BioNLP'09 shared task on event extraction. In Proceedings of the BioNLP 2009 Workshop: Shared Task, Boulder , CO , pp. 1–9. Klein, D., and C. D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41th Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 423–430. Krallinger, M., A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and A. Valencia. 2008. Evaluation of text-mining systems for biology: overview of the second BioCreative community challenge. Genome Biology, 9(Suppl 2): S1. Leroy, G., H. Chen, and J. D. Martinez. 2003. A shallow parser based on closed-class words to capture relations in biomedical text. Journal of Biomedical Informatics, 36: 145–158. Light, M., X. Y. Qiu, and P. Srinivasan. 2004. The language of bioscience: facts, speculations, and statements in between. In BioLINK 2004: Linking Biological Literature, Ontologies and Databases, Boston , pp. 17–24. McCray, A. T., S. Srinivasan, and A. C. Browne. 1994. Lexical methods for managing variation in biomedical terminologies. In Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care, Philadelphia, pp. 235–239. Medlock, B., and T. Briscoe. 2007. Weakly supervised learning for hedge classification in scientific literature. In Proceedings of the 45th Meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp. 992–999. Mel'čuk, I. A.. 1988. Dependency Syntax: Theory and Practice. New York : State University Press of New York. Miyao, Y., K. Sagae, R. Saetre, T. Matsuzaki, and J. Tsujii. 2009. Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics, 25(3): 394–400. Morante, R., and W. Daelemans. 2009. Learning the scope of hedge cues in biomedical texts. In Proceedings of the BioNLP 2009 Workshop, Boulder, CO, pp. 28–36. Morante, R., A. Liekens, and W. Daelemans. 2008. Learning the scope of negation in biomedical text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Honolulu, HI, pp. 715–724. Mutalik, P. G., A. Deshpande, and P. M. Nadkarni. 2001. Use of general-purpose negation detection to augment concept indexing of medical documents: A quantitative study using the UMLS. Journal of the American Medical Informatics Association, 8(6): 598–609. Nédellec, C.. 2005. Learning language in logic: Genic interaction extraction challenge. In Proceedings of the ICML 2005 Workshop: Learning Language in Logic (LLL05), Bonn , Germany . Ono, T., H. Hishigaki, A. Tanigami, and T. Takagi. 2001. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 17(2): 155–161. Pustejovsky, J., R. Ingria, R. Saurí, J. Castaño, J. Littman, R. Gaizauskas, A. Setzer, G. Katz, and I. Mani. 2005. The specification language TimeML. In I. Mani, J. Pustejovsky, and R. Gaizauskas, editors, The Language of Time: A Reader, pp. 545–558. Oxford , UK : Oxford University Press. Rebholz-Schuhmann, D., H. Kirsch, M. Arregui, S. Gaudan, M. Riethoven, and P. Stoehr. 2007. EBIMed–text crunching to gather facts for proteins from medline. Bioinformatics, 23(2): e237–e244. Rinaldi, F., G. Schneider, K. Kaljurand, M. Hess, C. Andronis, O. Konstandi, and A. Persidis. 2007. Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach. Artificial Intelligence in Medicine, 39(2): 127–136. Rindflesch, T. C., L. Tanabe, J. N. Weinstein, and L. Hunter. 2000. EDGAR: Extraction of drugs, genes, and relations from the biomedical literature. In Pacific Symposium on Biocomputing, Oahu, HI, pp. 514–525. Rokach, L., R. Romano, and O. Maimon. 2008. Negation recognition in medical narrative reports. Information Retrieval, 11(6): 499–538. Sætre, R., K. Sagae, and J. Tsujii. 2007. Syntactic features for protein-protein interaction extraction. In Proceedings of the Second International Symposium on Languages in Biology and Medicine (LBM 2007), Singapore , pp. 6.1–6.14. Sætre, R., M. Miwa, K. Yoshida, and J. Tsujii. 2009. From protein-protein interaction to molecular event extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, Boulder, CO, pp. 103–106. Sanchez-Graillet, O., and M. Poesio. 2007. Negation of protein protein interactions: Analysis and extraction. Bioinformatics, 23(13): 424–432. Saric, J., L. J. Jensen, I. Rojas, and P. Bork. 2006. Extraction of regulatory gene/protein networks from Medline. Bioinformatics, 22(6): 645–650. Schuman, J., and S. Bergler. 2006. Postnominal prepositional phrase attachment in proteomics. In Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology, New York, pp. 82–89. Szarvas, G.. 2008. Hedge classification in biomedical texts with a weakly supervised selection of keywords. In Proceedings of the 46th Meeting of the Association for Computational Linguistics, Columbus, OH, pp. 281–289. Van Landeghem, S., Y. Saeys, and Y. Van de Peer, 2008. Extracting protein-protein interactions from text using rich feature vectors and feature selection. In Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland, pp. 77–84. Van Landeghem, S., Y. Saeys, B. De Baets, and Y. Van de Peer, 2009. Analyzing text in search of bio-molecular events: A high-precision machine learning framework. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, Boulder, CO, pp. 128–136. Vincze, V., G. Szarvas, R. Farkas, G. Mora, and J. Csirik, 2008. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinformatics, 9(Suppl 11): S9. Wilbur, W. J., A. Rzhetsky, and H. Shatkay. 2006. New directions in biomedical text annotations: definitions, guidelines and corpus construction. BMC Bioinformatics, 7: 356. Yakushiji, A., Y. Miyao, Y. Tateisi, and J. Tsujii. 2005. Biomedical event extraction with predicate-argument structure patterns. In Proceedings of the First International Symposium on Semantic Mining in Biomedicine, Hinxton, UK, pp. 60–69. Zweigenbaum, P., D. Demner-Fushman, H. Yu, and K. B. Cohen. 2007. Frontiers of biological text mining: current progress. Briefings in Bioinformatics, 8(5): 358–375. Citing Literature Volume27, Issue4November 2011Pages 583-609 ReferencesRelatedInformation

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

EFFECTIVE BIO-EVENT EXTRACTION USING TRIGGER WORDS AND SYNTACTIC DEPENDENCIES