Biomedical Event Extraction with Machine Learning
Björne, Jari (2014-08-07)
Biomedical Event Extraction with Machine Learning
Björne, Jari
(07.08.2014)
Turku Centre for Computer Science
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2014071832493
https://urn.fi/URN:NBN:fi-fe2014071832493
Kuvaus
Siirretty Doriasta
Tiivistelmä
Biomedical natural language processing (BioNLP) is a subfield of natural
language processing, an area of computational linguistics concerned with
developing programs that work with natural language: written texts and
speech. Biomedical relation extraction concerns the detection of semantic
relations such as protein-protein interactions (PPI) from scientific texts.
The aim is to enhance information retrieval by detecting relations between
concepts, not just individual concepts as with a keyword search.
In recent years, events have been proposed as a more detailed alternative
for simple pairwise PPI relations. Events provide a systematic, structural
representation for annotating the content of natural language texts. Events
are characterized by annotated trigger words, directed and typed arguments
and the ability to nest other events. For example, the sentence “Protein A
causes protein B to bind protein C” can be annotated with the nested event
structure CAUSE(A, BIND(B, C)). Converted to such formal representations,
the information of natural language texts can be used by computational
applications. Biomedical event annotations were introduced by the
BioInfer and GENIA corpora, and event extraction was popularized by the
BioNLP'09 Shared Task on Event Extraction.
In this thesis we present a method for automated event extraction, implemented
as the Turku Event Extraction System (TEES). A unified graph
format is defined for representing event annotations and the problem of
extracting complex event structures is decomposed into a number of independent
classification tasks. These classification tasks are solved using SVM
and RLS classifiers, utilizing rich feature representations built from full dependency
parsing. Building on earlier work on pairwise relation extraction
and using a generalized graph representation, the resulting TEES system is
capable of detecting binary relations as well as complex event structures.
We show that this event extraction system has good performance, reaching
the first place in the BioNLP'09 Shared Task on Event Extraction.
Subsequently, TEES has achieved several first ranks in the BioNLP'11 and
BioNLP'13 Shared Tasks, as well as shown competitive performance in the
binary relation Drug-Drug Interaction Extraction 2011 and 2013 shared
tasks.
The Turku Event Extraction System is published as a freely available
open-source project, documenting the research in detail as well as making
the method available for practical applications. In particular, in this thesis
we describe the application of the event extraction method to PubMed-scale
text mining, showing how the developed approach not only shows good
performance, but is generalizable and applicable to large-scale real-world
text mining projects.
Finally, we discuss related literature, summarize the contributions of the
work and present some thoughts on future directions for biomedical event
extraction. This thesis includes and builds on six original research publications.
The first of these introduces the analysis of dependency parses that
leads to development of TEES. The entries in the three BioNLP Shared
Tasks, as well as in the DDIExtraction 2011 task are covered in four publications,
and the sixth one demonstrates the application of the system to
PubMed-scale text mining.
language processing, an area of computational linguistics concerned with
developing programs that work with natural language: written texts and
speech. Biomedical relation extraction concerns the detection of semantic
relations such as protein-protein interactions (PPI) from scientific texts.
The aim is to enhance information retrieval by detecting relations between
concepts, not just individual concepts as with a keyword search.
In recent years, events have been proposed as a more detailed alternative
for simple pairwise PPI relations. Events provide a systematic, structural
representation for annotating the content of natural language texts. Events
are characterized by annotated trigger words, directed and typed arguments
and the ability to nest other events. For example, the sentence “Protein A
causes protein B to bind protein C” can be annotated with the nested event
structure CAUSE(A, BIND(B, C)). Converted to such formal representations,
the information of natural language texts can be used by computational
applications. Biomedical event annotations were introduced by the
BioInfer and GENIA corpora, and event extraction was popularized by the
BioNLP'09 Shared Task on Event Extraction.
In this thesis we present a method for automated event extraction, implemented
as the Turku Event Extraction System (TEES). A unified graph
format is defined for representing event annotations and the problem of
extracting complex event structures is decomposed into a number of independent
classification tasks. These classification tasks are solved using SVM
and RLS classifiers, utilizing rich feature representations built from full dependency
parsing. Building on earlier work on pairwise relation extraction
and using a generalized graph representation, the resulting TEES system is
capable of detecting binary relations as well as complex event structures.
We show that this event extraction system has good performance, reaching
the first place in the BioNLP'09 Shared Task on Event Extraction.
Subsequently, TEES has achieved several first ranks in the BioNLP'11 and
BioNLP'13 Shared Tasks, as well as shown competitive performance in the
binary relation Drug-Drug Interaction Extraction 2011 and 2013 shared
tasks.
The Turku Event Extraction System is published as a freely available
open-source project, documenting the research in detail as well as making
the method available for practical applications. In particular, in this thesis
we describe the application of the event extraction method to PubMed-scale
text mining, showing how the developed approach not only shows good
performance, but is generalizable and applicable to large-scale real-world
text mining projects.
Finally, we discuss related literature, summarize the contributions of the
work and present some thoughts on future directions for biomedical event
extraction. This thesis includes and builds on six original research publications.
The first of these introduces the analysis of dependency parses that
leads to development of TEES. The entries in the three BioNLP Shared
Tasks, as well as in the DDIExtraction 2011 task are covered in four publications,
and the sixth one demonstrates the application of the system to
PubMed-scale text mining.
Kokoelmat
- Väitöskirjat [2918]