A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia

Freitas, Andre; Carvalho, Danilo; Silva, João Carlos; O'Riain, Sean; Curry, Edward

by Andre Freitas, Danilo Carvalho, João Carlos Silva, Sean O'Riain, Edward Curry

Abstract:

Most information extraction approaches available today have either focused on the extraction of simple relations or in scenarios where data extracted from texts should be normalized into a database schema or ontology. Some relevant information present in natural language texts, however, can be irregular, highly contextualized, with complex semantic dependency relations, poorly structured, and intrinsically ambiguous. These characteristics should also be supported by an information extraction approach. To cope with this scenario, this work introduces a semantic best-eﬀort information extraction approach, which targets an information extraction scenario where text information is extracted under a pay-as-you-go data quality perspective, trading high-accuracy, schema consistency and terminological normalization for domain-independency, context capture, wider extraction scope and maximization of the text semantics extraction and representation. A semantic information extraction framework (Graphia) is implemented and evaluated over the Wikipedia corpus.

View PDF

Reference:

Andre Freitas, Danilo Carvalho, João Carlos Silva, Sean O'Riain, Edward Curry, "A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia", In 1st Workshop on the Web of Linked Entities (WoLE 2012), Boston, MA, pp. 70-81, 2012. [slides]

Bibtex Entry:

@inproceedings{Freitas,
abstract = {Most information extraction approaches available today have either focused on the extraction of simple relations or in scenarios where data extracted from texts should be normalized into a database schema or ontology. Some relevant information present in natural language texts, however, can be irregular, highly contextualized, with complex semantic dependency relations, poorly structured, and intrinsically ambiguous. These characteristics should also be supported by an information extraction approach. To cope with this scenario, this work introduces a semantic best-eﬀort information extraction approach, which targets an information extraction scenario where text information is extracted under a pay-as-you-go data quality perspective, trading high-accuracy, schema consistency and terminological normalization for domain-independency, context capture, wider extraction scope and maximization of the text semantics extraction and representation. A semantic information extraction framework (Graphia) is implemented and evaluated over the Wikipedia corpus.},
address = {Boston, MA},
annote = {<a href="http://www.slideshare.net/andrenfreitas/wo-le-freitascarvalhopereiraoriaincurrytopdf">[slides]</a>},
author = {Freitas, Andre and Carvalho, Danilo and Silva, Jo{\~{a}}o Carlos and O'Riain, Sean and Curry, Edward},
booktitle = {1st Workshop on the Web of Linked Entities (WoLE 2012)},
file = {:Users/ed/Library/Application Support/Mendeley Desktop/Downloaded/Freitas et al. - 2012 - A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia.pdf:pdf},
keywords = {Information Extraction,Linked Data,RDF,Semantic Best-eﬀort extraction,Semantic Networks,Semantic Web,Treo},
mendeley-tags = {Treo},
pages = {70--81},
title = {{A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs from Wikipedia}},
url = {http://www.edwardcurry.org/publications/wole2012.pdf},
year = {2012}
}