Knowledge extraction
View on WikipediaKnowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL (data warehouse), the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data.
The RDB2RDF W3C group [1] is currently standardizing a language for extraction of resource description frameworks (RDF) from relational databases. Another popular example for knowledge extraction is the transformation of Wikipedia into structured data and also the mapping to existing knowledge (see DBpedia and Freebase).
Overview
[edit]After the standardization of knowledge representation languages such as RDF and OWL, much research has been conducted in the area, especially regarding transforming relational databases into RDF, identity resolution, knowledge discovery and ontology learning. The general process uses traditional methods from information extraction and extract, transform, and load (ETL), which transform the data from the sources into structured formats. So understanding how the interact and learn from each other.
The following criteria can be used to categorize approaches in this topic (some of them only account for extraction from relational databases):[2]
| Source | Which data sources are covered: Text, Relational Databases, XML, CSV |
|---|---|
| Exposition | How is the extracted knowledge made explicit (ontology file, semantic database)? How can you query it? |
| Synchronization | Is the knowledge extraction process executed once to produce a dump or is the result synchronized with the source? Static or dynamic. Are changes to the result written back (bi-directional) |
| Reuse of vocabularies | The tool is able to reuse existing vocabularies in the extraction. For example, the table column 'firstName' can be mapped to foaf:firstName. Some automatic approaches are not capable of mapping vocab. |
| Automatization | The degree to which the extraction is assisted/automated. Manual, GUI, semi-automatic, automatic. |
| Requires a domain ontology | A pre-existing ontology is needed to map to it. So either a mapping is created or a schema is learned from the source (ontology learning). |
Examples
[edit]Entity linking
[edit]- DBpedia Spotlight, OpenCalais, Dandelion dataTXT, the Zemanta API, Extractiv and PoolParty Extractor analyze free text via named-entity recognition and then disambiguates candidates via name resolution and links the found entities to the DBpedia knowledge repository[3] (Dandelion dataTXT demo or DBpedia Spotlight web demo or PoolParty Extractor Demo).
President Obama called Wednesday on Congress to extend a tax break for students included in last year's economic stimulus package, arguing that the policy provides more generous assistance.
- As President Obama is linked to a DBpedia LinkedData resource, further information can be retrieved automatically and a Semantic Reasoner can for example infer that the mentioned entity is of the type Person (using FOAF (software)) and of type Presidents of the United States (using YAGO). Counter examples: Methods that only recognize entities or link to Wikipedia articles and other targets that do not provide further retrieval of structured data and formal knowledge.
Relational databases to RDF
[edit]- Triplify, D2R Server, Ultrawrap Archived 2016-11-27 at the Wayback Machine, and Virtuoso RDF Views are tools that transform relational databases to RDF. During this process they allow reusing existing vocabularies and ontologies during the conversion process. When transforming a typical relational table named users, one column (e.g.name) or an aggregation of columns (e.g.first_name and last_name) has to provide the URI of the created entity. Normally the primary key is used. Every other column can be extracted as a relation with this entity.[4] Then properties with formally defined semantics are used (and reused) to interpret the information. For example, a column in a user table called marriedTo can be defined as symmetrical relation and a column homepage can be converted to a property from the FOAF Vocabulary called foaf:homepage, thus qualifying it as an inverse functional property. Then each entry of the user table can be made an instance of the class foaf:Person (Ontology Population). Additionally domain knowledge (in form of an ontology) could be created from the status_id, either by manually created rules (if status_id is 2, the entry belongs to class Teacher ) or by (semi)-automated methods (ontology learning). Here is an example transformation:
| Name | marriedTo | homepage | status_id |
|---|---|---|---|
| Peter | Mary | https://example.org/Peters_page[permanent dead link] | 1 |
| Claus | Eva | https://example.org/Claus_page[permanent dead link] | 2 |
:Peter :marriedTo :Mary .
:marriedTo a owl:SymmetricProperty .
:Peter foaf:homepage <https://example.org/Peters_page> .
:Peter a foaf:Person .
:Peter a :Student .
:Claus a :Teacher .
Extraction from structured sources to RDF
[edit]1:1 Mapping from RDB Tables/Views to RDF Entities/Attributes/Values
[edit]When building a RDB representation of a problem domain, the starting point is frequently an entity-relationship diagram (ERD). Typically, each entity is represented as a database table, each attribute of the entity becomes a column in that table, and relationships between entities are indicated by foreign keys. Each table typically defines a particular class of entity, each column one of its attributes. Each row in the table describes an entity instance, uniquely identified by a primary key. The table rows collectively describe an entity set. In an equivalent RDF representation of the same entity set:
- Each column in the table is an attribute (i.e., predicate)
- Each column value is an attribute value (i.e., object)
- Each row key represents an entity ID (i.e., subject)
- Each row represents an entity instance
- Each row (entity instance) is represented in RDF by a collection of triples with a common subject (entity ID).
So, to render an equivalent view based on RDF semantics, the basic mapping algorithm would be as follows:
- create an RDFS class for each table
- convert all primary keys and foreign keys into IRIs
- assign a predicate IRI to each column
- assign an rdf:type predicate for each row, linking it to an RDFS class IRI corresponding to the table
- for each column that is neither part of a primary or foreign key, construct a triple containing the primary key IRI as the subject, the column IRI as the predicate and the column's value as the object.
Early mentioning of this basic or direct mapping can be found in Tim Berners-Lee's comparison of the ER model to the RDF model.[4]
Complex mappings of relational databases to RDF
[edit]The 1:1 mapping mentioned above exposes the legacy data as RDF in a straightforward way, additional refinements can be employed to improve the usefulness of RDF output respective the given Use Cases. Normally, information is lost during the transformation of an entity-relationship diagram (ERD) to relational tables (Details can be found in object-relational impedance mismatch) and has to be reverse engineered. From a conceptual view, approaches for extraction can come from two directions. The first direction tries to extract or learn an OWL schema from the given database schema. Early approaches used a fixed amount of manually created mapping rules to refine the 1:1 mapping.[5][6][7] More elaborate methods are employing heuristics or learning algorithms to induce schematic information (methods overlap with ontology learning). While some approaches try to extract the information from the structure inherent in the SQL schema[8] (analysing e.g. foreign keys), others analyse the content and the values in the tables to create conceptual hierarchies[9] (e.g. a columns with few values are candidates for becoming categories). The second direction tries to map the schema and its contents to a pre-existing domain ontology (see also: ontology alignment). Often, however, a suitable domain ontology does not exist and has to be created first.
XML
[edit]As XML is structured as a tree, any data can be easily represented in RDF, which is structured as a graph. XML2RDF is one example of an approach that uses RDF blank nodes and transforms XML elements and attributes to RDF properties. The topic however is more complex as in the case of relational databases. In a relational table the primary key is an ideal candidate for becoming the subject of the extracted triples. An XML element, however, can be transformed - depending on the context- as a subject, a predicate or object of a triple. XSLT can be used a standard transformation language to manually convert XML to RDF.
Survey of methods / tools
[edit]| Name | Data Source | Data Exposition | Data Synchronisation | Mapping Language | Vocabulary Reuse | Mapping Automat. | Req. Domain Ontology | Uses GUI |
|---|---|---|---|---|---|---|---|---|
| A Direct Mapping of Relational Data to RDF | Relational Data | SPARQL/ETL | dynamic | — | false | automatic | false | false |
| CSV2RDF4LOD | CSV | ETL | static | RDF | true | manual | false | false |
| CoNLL-RDF | TSV, CoNLL | SPARQL/ RDF stream | static | none | true | automatic (domain-specific, for use cases in language technology, preserves relations between rows) | false | false |
| Convert2RDF | Delimited text file | ETL | static | RDF/DAML | true | manual | false | true |
| D2R Server | RDB | SPARQL | bi-directional | D2R Map | true | manual | false | false |
| DartGrid | RDB | own query language | dynamic | Visual Tool | true | manual | false | true |
| DataMaster | RDB | ETL | static | proprietary | true | manual | true | true |
| Google Refine's RDF Extension | CSV, XML | ETL | static | none | semi-automatic | false | true | |
| Krextor | XML | ETL | static | xslt | true | manual | true | false |
| MAPONTO | RDB | ETL | static | proprietary | true | manual | true | false |
| METAmorphoses | RDB | ETL | static | proprietary xml based mapping language | true | manual | false | true |
| MappingMaster | CSV | ETL | static | MappingMaster | true | GUI | false | true |
| ODEMapster | RDB | ETL | static | proprietary | true | manual | true | true |
| OntoWiki CSV Importer Plug-in - DataCube & Tabular | CSV | ETL | static | The RDF Data Cube Vocaublary | true | semi-automatic | false | true |
| Poolparty Extraktor (PPX) | XML, Text | LinkedData | dynamic | RDF (SKOS) | true | semi-automatic | true | false |
| RDBToOnto | RDB | ETL | static | none | false | automatic, the user furthermore has the chance to fine-tune results | false | true |
| RDF 123 | CSV | ETL | static | false | false | manual | false | true |
| RDOTE | RDB | ETL | static | SQL | true | manual | true | true |
| Relational.OWL | RDB | ETL | static | none | false | automatic | false | false |
| T2LD | CSV | ETL | static | false | false | automatic | false | false |
| The RDF Data Cube Vocabulary | Multidimensional statistical data in spreadsheets | Data Cube Vocabulary | true | manual | false | |||
| TopBraid Composer | CSV | ETL | static | SKOS | false | semi-automatic | false | true |
| Triplify | RDB | LinkedData | dynamic | SQL | true | manual | false | false |
| Ultrawrap Archived 2016-11-27 at the Wayback Machine | RDB | SPARQL/ETL | dynamic | R2RML | true | semi-automatic | false | true |
| Virtuoso RDF Views | RDB | SPARQL | dynamic | Meta Schema Language | true | semi-automatic | false | true |
| Virtuoso Sponger | structured and semi-structured data sources | SPARQL | dynamic | Virtuoso PL & XSLT | true | semi-automatic | false | false |
| VisAVis | RDB | RDQL | dynamic | SQL | true | manual | true | true |
| XLWrap: Spreadsheet to RDF | CSV | ETL | static | TriG Syntax | true | manual | false | false |
| XML to RDF | XML | ETL | static | false | false | automatic | false | false |
Extraction from natural language sources
[edit]The largest portion of information contained in business documents (about 80%[10]) is encoded in natural language and therefore unstructured. Because unstructured data is rather a challenge for knowledge extraction, more sophisticated methods are required, which generally tend to supply worse results compared to structured data. The potential for a massive acquisition of extracted knowledge, however, should compensate the increased complexity and decreased quality of extraction. In the following, natural language sources are understood as sources of information, where the data is given in an unstructured fashion as plain text. If the given text is additionally embedded in a markup document (e. g. HTML document), the mentioned systems normally remove the markup elements automatically.
Linguistic annotation / natural language processing (NLP)
[edit]As a preprocessing step to knowledge extraction, it can be necessary to perform linguistic annotation by one or multiple NLP tools. Individual modules in an NLP workflow normally build on tool-specific formats for input and output, but in the context of knowledge extraction, structured formats for representing linguistic annotations have been applied.
Typical NLP tasks relevant to knowledge extraction include:
- part-of-speech (POS) tagging
- lemmatization (LEMMA) or stemming (STEM)
- word sense disambiguation (WSD, related to semantic annotation below)
- named entity recognition (NER, also see IE below)
- syntactic parsing, often adopting syntactic dependencies (DEP)
- shallow syntactic parsing (CHUNK): if performance is an issue, chunking yields a fast extraction of nominal and other phrases
- anaphor resolution (see coreference resolution in IE below, but seen here as the task to create links between textual mentions rather than between the mention of an entity and an abstract representation of the entity)
- semantic role labelling (SRL, related to relation extraction; not to be confused with semantic annotation as described below)
- discourse parsing (relations between different sentences, rarely used in real-world applications)
In NLP, such data is typically represented in TSV formats (CSV formats with TAB as separators), often referred to as CoNLL formats. For knowledge extraction workflows, RDF views on such data have been created in accordance with the following community standards:
- NLP Interchange Format (NIF, for many frequent types of annotation)[11][12]
- Web Annotation (WA, often used for entity linking)[13]
- CoNLL-RDF (for annotations originally represented in TSV formats)[14][15]
Other, platform-specific formats include
- LAPPS Interchange Format (LIF, used in the LAPPS Grid)[16][17]
- NLP Annotation Format (NAF, used in the NewsReader workflow management system)[18][19]
Traditional information extraction (IE)
[edit]Traditional information extraction[20] is a technology of natural language processing, which extracts information from typically natural language texts and structures these in a suitable manner. The kinds of information to be identified must be specified in a model before beginning the process, which is why the whole process of traditional Information Extraction is domain dependent. The IE is split in the following five subtasks.
- Named entity recognition (NER)
- Coreference resolution (CO)
- Template element construction (TE)
- Template relation construction (TR)
- Template scenario production (ST)
The task of named entity recognition is to recognize and to categorize all named entities contained in a text (assignment of a named entity to a predefined category). This works by application of grammar based methods or statistical models.
Coreference resolution identifies equivalent entities, which were recognized by NER, within a text. There are two relevant kinds of equivalence relationship. The first one relates to the relationship between two different represented entities (e.g. IBM Europe and IBM) and the second one to the relationship between an entity and their anaphoric references (e.g. it and IBM). Both kinds can be recognized by coreference resolution.
During template element construction the IE system identifies descriptive properties of entities, recognized by NER and CO. These properties correspond to ordinary qualities like red or big.
Template relation construction identifies relations, which exist between the template elements. These relations can be of several kinds, such as works-for or located-in, with the restriction, that both domain and range correspond to entities.
In the template scenario production events, which are described in the text, will be identified and structured with respect to the entities, recognized by NER and CO and relations, identified by TR.
Ontology-based information extraction (OBIE)
[edit]Ontology-based information extraction [10] is a subfield of information extraction, with which at least one ontology is used to guide the process of information extraction from natural language text. The OBIE system uses methods of traditional information extraction to identify concepts, instances and relations of the used ontologies in the text, which will be structured to an ontology after the process. Thus, the input ontologies constitute the model of information to be extracted.[21]
Ontology learning (OL)
[edit]Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms from natural language text. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.
Semantic annotation (SA)
[edit]During semantic annotation,[22] natural language text is augmented with metadata (often represented in RDFa), which should make the semantics of contained terms machine-understandable. At this process, which is generally semi-automatic, knowledge is extracted in the sense, that a link between lexical terms and for example concepts from ontologies is established. Thus, knowledge is gained, which meaning of a term in the processed context was intended and therefore the meaning of the text is grounded in machine-readable data with the ability to draw inferences. Semantic annotation is typically split into the following two subtasks.
At the terminology extraction level, lexical terms from the text are extracted. For this purpose a tokenizer determines at first the word boundaries and solves abbreviations. Afterwards terms from the text, which correspond to a concept, are extracted with the help of a domain-specific lexicon to link these at entity linking.
In entity linking [23] a link between the extracted lexical terms from the source text and the concepts from an ontology or knowledge base such as DBpedia is established. For this, candidate-concepts are detected appropriately to the several meanings of a term with the help of a lexicon. Finally, the context of the terms is analyzed to determine the most appropriate disambiguation and to assign the term to the correct concept.
Note that "semantic annotation" in the context of knowledge extraction is not to be confused with semantic parsing as understood in natural language processing (also referred to as "semantic annotation"): Semantic parsing aims a complete, machine-readable representation of natural language, whereas semantic annotation in the sense of knowledge extraction tackles only a very elementary aspect of that.
Tools
[edit]The following criteria can be used to categorize tools, which extract knowledge from natural language text.
| Source | Which input formats can be processed by the tool (e.g. plain text, HTML or PDF)? |
| Access Paradigm | Can the tool query the data source or requires a whole dump for the extraction process? |
| Data Synchronization | Is the result of the extraction process synchronized with the source? |
| Uses Output Ontology | Does the tool link the result with an ontology? |
| Mapping Automation | How automated is the extraction process (manual, semi-automatic or automatic)? |
| Requires Ontology | Does the tool need an ontology for the extraction? |
| Uses GUI | Does the tool offer a graphical user interface? |
| Approach | Which approach (IE, OBIE, OL or SA) is used by the tool? |
| Extracted Entities | Which types of entities (e.g. named entities, concepts or relationships) can be extracted by the tool? |
| Applied Techniques | Which techniques are applied (e.g. NLP, statistical methods, clustering or machine learning)? |
| Output Model | Which model is used to represent the result of the tool (e. g. RDF or OWL)? |
| Supported Domains | Which domains are supported (e.g. economy or biology)? |
| Supported Languages | Which languages can be processed (e.g. English or German)? |
The following table characterizes some tools for Knowledge Extraction from natural language sources.
| Name | Source | Access Paradigm | Data Synchronization | Uses Output Ontology | Mapping Automation | Requires Ontology | Uses GUI | Approach | Extracted Entities | Applied Techniques | Output Model | Supported Domains | Supported Languages |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [1] [24] | plain text, HTML, XML, SGML | dump | no | yes | automatic | yes | yes | IE | named entities, relationships, events | linguistic rules | proprietary | domain-independent | English, Spanish, Arabic, Chinese, indonesian |
| AlchemyAPI [25] | plain text, HTML | automatic | yes | SA | multilingual | ||||||||
| ANNIE [26] | plain text | dump | yes | yes | IE | finite state algorithms | multilingual | ||||||
| ASIUM [27] | plain text | dump | semi-automatic | yes | OL | concepts, concept hierarchy | NLP, clustering | ||||||
| Attensity Exhaustive Extraction [28] | automatic | IE | named entities, relationships, events | NLP | |||||||||
| Dandelion API | plain text, HTML, URL | REST | no | no | automatic | no | yes | SA | named entities, concepts | statistical methods | JSON | domain-independent | multilingual |
| DBpedia Spotlight [29] | plain text, HTML | dump, SPARQL | yes | yes | automatic | no | yes | SA | annotation to each word, annotation to non-stopwords | NLP, statistical methods, machine learning | RDFa | domain-independent | English |
| EntityClassifier.eu | plain text, HTML | dump | yes | yes | automatic | no | yes | IE, OL, SA | annotation to each word, annotation to non-stopwords | rule-based grammar | XML | domain-independent | English, German, Dutch |
| FRED [30] | plain text | dump, REST API | yes | yes | automatic | no | yes | IE, OL, SA, ontology design patterns, frame semantics | (multi-)word NIF or EarMark annotation, predicates, instances, compositional semantics, concept taxonomies, frames, semantic roles, periphrastic relations, events, modality, tense, entity linking, event linking, sentiment | NLP, machine learning, heuristic rules | RDF/OWL | domain-independent | English, other languages via translation |
| iDocument [31] | HTML, PDF, DOC | SPARQL | yes | yes | OBIE | instances, property values | NLP | personal, business | |||||
| NetOwl Extractor [32] | plain text, HTML, XML, SGML, PDF, MS Office | dump | No | Yes | Automatic | yes | Yes | IE | named entities, relationships, events | NLP | XML, JSON, RDF-OWL, others | multiple domains | English, Arabic Chinese (Simplified and Traditional), French, Korean, Persian (Farsi and Dari), Russian, Spanish |
| OntoGen Archived 2010-03-30 at the Wayback Machine [33] | semi-automatic | yes | OL | concepts, concept hierarchy, non-taxonomic relations, instances | NLP, machine learning, clustering | ||||||||
| OntoLearn [34] | plain text, HTML | dump | no | yes | automatic | yes | no | OL | concepts, concept hierarchy, instances | NLP, statistical methods | proprietary | domain-independent | English |
| OntoLearn Reloaded | plain text, HTML | dump | no | yes | automatic | yes | no | OL | concepts, concept hierarchy, instances | NLP, statistical methods | proprietary | domain-independent | English |
| OntoSyphon [35] | HTML, PDF, DOC | dump, search engine queries | no | yes | automatic | yes | no | OBIE | concepts, relations, instances | NLP, statistical methods | RDF | domain-independent | English |
| ontoX [36] | plain text | dump | no | yes | semi-automatic | yes | no | OBIE | instances, datatype property values | heuristic-based methods | proprietary | domain-independent | language-independent |
| OpenCalais | plain text, HTML, XML | dump | no | yes | automatic | yes | no | SA | annotation to entities, annotation to events, annotation to facts | NLP, machine learning | RDF | domain-independent | English, French, Spanish |
| PoolParty Extractor [37] | plain text, HTML, DOC, ODT | dump | no | yes | automatic | yes | yes | OBIE | named entities, concepts, relations, concepts that categorize the text, enrichments | NLP, machine learning, statistical methods | RDF, OWL | domain-independent | English, German, Spanish, French |
| Rosoka | plain text, HTML, XML, SGML, PDF, MS Office | dump | Yes | Yes | Automatic | no | Yes | IE | named entity extraction, entity resolution, relationship extraction, attributes, concepts, multi-vector sentiment analysis, geotagging, language identification | NLP, machine learning | XML, JSON, POJO, RDF | multiple domains | Multilingual 200+ Languages |
| SCOOBIE | plain text, HTML | dump | no | yes | automatic | no | no | OBIE | instances, property values, RDFS types | NLP, machine learning | RDF, RDFa | domain-independent | English, German |
| SemTag [38][39] | HTML | dump | no | yes | automatic | yes | no | SA | machine learning | database record | domain-independent | language-independent | |
| smart FIX | plain text, HTML, PDF, DOC, e-Mail | dump | yes | no | automatic | no | yes | OBIE | named entities | NLP, machine learning | proprietary | domain-independent | English, German, French, Dutch, polish |
| Text2Onto [40] | plain text, HTML, PDF | dump | yes | no | semi-automatic | yes | yes | OL | concepts, concept hierarchy, non-taxonomic relations, instances, axioms | NLP, statistical methods, machine learning, rule-based methods | OWL | deomain-independent | English, German, Spanish |
| Text-To-Onto [41] | plain text, HTML, PDF, PostScript | dump | semi-automatic | yes | yes | OL | concepts, concept hierarchy, non-taxonomic relations, lexical entities referring to concepts, lexical entities referring to relations | NLP, machine learning, clustering, statistical methods | German | ||||
| ThatNeedle | Plain Text | dump | automatic | no | concepts, relations, hierarchy | NLP, proprietary | JSON | multiple domains | English | ||||
| The Wiki Machine [42] | plain text, HTML, PDF, DOC | dump | no | yes | automatic | yes | yes | SA | annotation to proper nouns, annotation to common nouns | machine learning | RDFa | domain-independent | English, German, Spanish, French, Portuguese, Italian, Russian |
| ThingFinder [43] | IE | named entities, relationships, events | multilingual |
Knowledge discovery
[edit]Knowledge discovery describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data.[44] It is often described as deriving knowledge from the input data. Knowledge discovery developed out of the data mining domain, and is closely related to it both in terms of methodology and terminology.[45]
The most well-known branch of data mining is knowledge discovery, also known as knowledge discovery in databases (KDD). Just as many other forms of knowledge discovery it creates abstractions of the input data. The knowledge obtained through the process may become additional data that can be used for further usage and discovery. Often the outcomes from knowledge discovery are not actionable, techniques like domain driven data mining,[46] aims to discover and deliver actionable knowledge and insights.
Another promising application of knowledge discovery is in the area of software modernization, weakness discovery and compliance which involves understanding existing software artifacts. This process is related to a concept of reverse engineering. Usually the knowledge obtained from existing software is presented in the form of models to which specific queries can be made when necessary. An entity relationship is a frequent format of representing knowledge obtained from existing software. Object Management Group (OMG) developed the specification Knowledge Discovery Metamodel (KDM) which defines an ontology for the software assets and their relationships for the purpose of performing knowledge discovery in existing code. Knowledge discovery from existing software systems, also known as software mining is closely related to data mining, since existing software artifacts contain enormous value for risk management and business value, key for the evaluation and evolution of software systems. Instead of mining individual data sets, software mining focuses on metadata, such as process flows (e.g. data flows, control flows, & call maps), architecture, database schemas, and business rules/terms/process.
Input data
[edit]Output formats
[edit]See also
[edit]Further reading
[edit]- Chicco, D; Masseroli, M (2016). "Ontology-based prediction and prioritization of gene functional annotations". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 13 (2): 248–260. doi:10.1109/TCBB.2015.2459694. PMID 27045825. S2CID 2795344.
References
[edit]- ^ RDB2RDF Working Group, Website: http://www.w3.org/2001/sw/rdb2rdf/, charter: http://www.w3.org/2009/08/rdb2rdf-charter, R2RML: RDB to RDF Mapping Language: http://www.w3.org/TR/r2rml/
- ^ LOD2 EU Deliverable 3.1.1 Knowledge Extraction from Structured Sources http://static.lod2.eu/Deliverables/deliverable-3.1.1.pdf Archived 2011-08-27 at the Wayback Machine
- ^ "Life in the Linked Data Cloud". www.opencalais.com. Archived from the original on 2009-11-24. Retrieved 2009-11-10.
Wikipedia has a Linked Data twin called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format.
- ^ a b Tim Berners-Lee (1998), "Relational Databases on the Semantic Web". Retrieved: February 20, 2011.
- ^ Hu et al. (2007), "Discovering Simple Mappings Between Relational Database Schemas and Ontologies", In Proc. of 6th International Semantic Web Conference (ISWC 2007), 2nd Asian Semantic Web Conference (ASWC 2007), LNCS 4825, pages 225‐238, Busan, Korea, 11‐15 November 2007. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.6934&rep=rep1&type=pdf
- ^ R. Ghawi and N. Cullot (2007), "Database-to-Ontology Mapping Generation for Semantic Interoperability". In Third International Workshop on Database Interoperability (InterDB 2007). http://le2i.cnrs.fr/IMG/publications/InterDB07-Ghawi.pdf
- ^ Li et al. (2005) "A Semi-automatic Ontology Acquisition Method for the Semantic Web", WAIM, volume 3739 of Lecture Notes in Computer Science, page 209-220. Springer. doi:10.1007/11563952_19
- ^ Tirmizi et al. (2008), "Translating SQL Applications to the Semantic Web", Lecture Notes in Computer Science, Volume 5181/2008 (Database and Expert Systems Applications). http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=15E8AB2A37BD06DAE59255A1AC3095F0?doi=10.1.1.140.3169&rep=rep1&type=pdf
- ^ Farid Cerbah (2008). "Learning Highly Structured Semantic Repositories from Relational Databases", The Semantic Web: Research and Applications, volume 5021 of Lecture Notes in Computer Science, Springer, Berlin / Heidelberg http://www.tao-project.eu/resources/publications/cerbah-learning-highly-structured-semantic-repositories-from-relational-databases.pdf Archived 2011-07-20 at the Wayback Machine
- ^ a b Wimalasuriya, Daya C.; Dou, Dejing (2010). "Ontology-based information extraction: An introduction and a survey of current approaches", Journal of Information Science, 36(3), p. 306 - 323, http://ix.cs.uoregon.edu/~dou/research/papers/jis09.pdf (retrieved: 18.06.2012).
- ^ "NLP Interchange Format (NIF) 2.0 - Overview and Documentation". persistence.uni-leipzig.org. Retrieved 2020-06-05.
- ^ Hellmann, Sebastian; Lehmann, Jens; Auer, Sören; Brümmer, Martin (2013). "Integrating NLP Using Linked Data". In Alani, Harith; Kagal, Lalana; Fokoue, Achille; Groth, Paul; Biemann, Chris; Parreira, Josiane Xavier; Aroyo, Lora; Noy, Natasha; Welty, Chris (eds.). The Semantic Web – ISWC 2013. Lecture Notes in Computer Science. Vol. 7908. Berlin, Heidelberg: Springer. pp. 98–113. doi:10.1007/978-3-642-41338-4_7. ISBN 978-3-642-41338-4.
- ^ Verspoor, Karin; Livingston, Kevin (July 2012). "Towards Adaptation of Linguistic Annotations to Scholarly Annotation Formalisms on the Semantic Web". Proceedings of the Sixth Linguistic Annotation Workshop. Jeju, Republic of Korea: Association for Computational Linguistics: 75–84.
- ^ acoli-repo/conll-rdf, ACoLi, 2020-05-27, retrieved 2020-06-05
- ^ Chiarcos, Christian; Fäth, Christian (2017). "CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way". In Gracia, Jorge; Bond, Francis; McCrae, John P.; Buitelaar, Paul; Chiarcos, Christian; Hellmann, Sebastian (eds.). Language, Data, and Knowledge. Lecture Notes in Computer Science. Vol. 10318. Cham: Springer International Publishing. pp. 74–88. doi:10.1007/978-3-319-59888-8_6. ISBN 978-3-319-59888-8.
- ^ Verhagen, Marc; Suderman, Keith; Wang, Di; Ide, Nancy; Shi, Chunqi; Wright, Jonathan; Pustejovsky, James (2016). "The LAPPS Interchange Format". In Murakami, Yohei; Lin, Donghui (eds.). Worldwide Language Service Infrastructure. Lecture Notes in Computer Science. Vol. 9442. Cham: Springer International Publishing. pp. 33–47. doi:10.1007/978-3-319-31468-6_3. ISBN 978-3-319-31468-6.
- ^ "The Language Application Grid | A web service platform for natural language processing development and research". Retrieved 2020-06-05.
- ^ newsreader/NAF, NewsReader, 2020-05-25, retrieved 2020-06-05
- ^ Vossen, Piek; Agerri, Rodrigo; Aldabe, Itziar; Cybulska, Agata; van Erp, Marieke; Fokkens, Antske; Laparra, Egoitz; Minard, Anne-Lyse; Palmero Aprosio, Alessio; Rigau, German; Rospocher, Marco (2016-10-15). "NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news". Knowledge-Based Systems. 110: 60–85. doi:10.1016/j.knosys.2016.07.013. ISSN 0950-7051.
- ^ Cunningham, Hamish (2005). "Information Extraction, Automatic", Encyclopedia of Language and Linguistics, 2, p. 665 - 677, http://gate.ac.uk/sale/ell2/ie/main.pdf (retrieved: 18.06.2012).
- ^ Chicco, D; Masseroli, M (2016). "Ontology-based prediction and prioritization of gene functional annotations". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 13 (2): 248–260. doi:10.1109/TCBB.2015.2459694. PMID 27045825. S2CID 2795344.
- ^ Erdmann, M.; Maedche, Alexander; Schnurr, H.-P.; Staab, Steffen (2000). "From Manual to Semi-automatic Semantic Annotation: About Ontology-based Text Annotation Tools", Proceedings of the COLING, http://www.ida.liu.se/ext/epa/cis/2001/002/paper.pdf (retrieved: 18.06.2012).
- ^ Rao, Delip; McNamee, Paul; Dredze, Mark (2011). "Entity Linking: Finding Extracted Entities in a Knowledge Base", Multi-source, Multi-lingual Information Extraction and Summarization, http://www.cs.jhu.edu/~delip/entity-linking.pdf[permanent dead link] (retrieved: 18.06.2012).
- ^ Rocket Software, Inc. (2012). "technology for extracting intelligence from text", http://www.rocketsoftware.com/products/aerotext Archived 2013-06-21 at the Wayback Machine (retrieved: 18.06.2012).
- ^ Orchestr8 (2012): "AlchemyAPI Overview", http://www.alchemyapi.com/api Archived 2016-05-13 at the Wayback Machine (retrieved: 18.06.2012).
- ^ The University of Sheffield (2011). "ANNIE: a Nearly-New Information Extraction System", http://gate.ac.uk/sale/tao/splitch6.html#chap:annie (retrieved: 18.06.2012).
- ^ ILP Network of Excellence. "ASIUM (LRI)", http://www-ai.ijs.si/~ilpnet2/systems/asium.html (retrieved: 18.06.2012).
- ^ Attensity (2012). "Exhaustive Extraction", http://www.attensity.com/products/technology/semantic-server/exhaustive-extraction/ Archived 2012-07-11 at the Wayback Machine (retrieved: 18.06.2012).
- ^ Mendes, Pablo N.; Jakob, Max; Garcia-Sílva, Andrés; Bizer; Christian (2011). "DBpedia Spotlight: Shedding Light on the Web of Documents", Proceedings of the 7th International Conference on Semantic Systems, p. 1 - 8, http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/Mendes-Jakob-GarciaSilva-Bizer-DBpediaSpotlight-ISEM2011.pdf Archived 2012-04-05 at the Wayback Machine (retrieved: 18.06.2012).
- ^ Gangemi, Aldo; Presutti, Valentina; Reforgiato Recupero, Diego; Nuzzolese, Andrea Giovanni; Draicchio, Francesco; Mongiovì, Misael (2016). "Semantic Web Machine Reading with FRED", Semantic Web Journal, doi:10.3233/SW-160240, http://www.semantic-web-journal.net/system/files/swj1379.pdf
- ^ Adrian, Benjamin; Maus, Heiko; Dengel, Andreas (2009). "iDocument: Using Ontologies for Extracting Information from Text", http://www.dfki.uni-kl.de/~maus/dok/AdrianMausDengel09.pdf (retrieved: 18.06.2012).
- ^ SRA International, Inc. (2012). "NetOwl Extractor", http://www.sra.com/netowl/entity-extraction/ Archived 2012-09-24 at the Wayback Machine (retrieved: 18.06.2012).
- ^ Fortuna, Blaz; Grobelnik, Marko; Mladenic, Dunja (2007). "OntoGen: Semi-automatic Ontology Editor", Proceedings of the 2007 conference on Human interface, Part 2, p. 309 - 318, http://analytics.ijs.si/~blazf/papers/OntoGen2_HCII2007.pdf Archived 2013-09-18 at the Wayback Machine (retrieved: 18.06.2012).
- ^ Missikoff, Michele; Navigli, Roberto; Velardi, Paola (2002). "Integrated Approach to Web Ontology Learning and Engineering", Computer, 35(11), p. 60 - 63, http://wwwusers.di.uniroma1.it/~velardi/IEEE_C.pdf Archived 2017-05-19 at the Wayback Machine (retrieved: 18.06.2012).
- ^ McDowell, Luke K.; Cafarella, Michael (2006). "Ontology-driven Information Extraction with OntoSyphon", Proceedings of the 5th international conference on The Semantic Web, p. 428 - 444, http://turing.cs.washington.edu/papers/iswc2006McDowell-final.pdf (retrieved: 18.06.2012).
- ^ Yildiz, Burcu; Miksch, Silvia (2007). "ontoX - A Method for Ontology-Driven Information Extraction", Proceedings of the 2007 international conference on Computational science and its applications, 3, p. 660 - 673, http://publik.tuwien.ac.at/files/pub-inf_4769.pdf Archived 2017-07-05 at the Wayback Machine (retrieved: 18.06.2012).
- ^ semanticweb.org (2011). "PoolParty Extractor", http://semanticweb.org/wiki/PoolParty_Extractor Archived 2016-03-04 at the Wayback Machine (retrieved: 18.06.2012).
- ^ Dill, Stephen; Eiron, Nadav; Gibson, David; Gruhl, Daniel; Guha, R.; Jhingran, Anant; Kanungo, Tapas; Rajagopalan, Sridhar; Tomkins, Andrew; Tomlin, John A.; Zien, Jason Y. (2003). "SemTag and Seeker: Bootstraping the Semantic Web via Automated Semantic Annotation", Proceedings of the 12th international conference on World Wide Web, p. 178 - 186, http://www2003.org/cdrom/papers/refereed/p831/p831-dill.html (retrieved: 18.06.2012).
- ^ Uren, Victoria; Cimiano, Philipp; Iria, José; Handschuh, Siegfried; Vargas-Vera, Maria; Motta, Enrico; Ciravegna, Fabio (2006). "Semantic annotation for knowledge management: Requirements and a survey of the state of the art", Web Semantics: Science, Services and Agents on the World Wide Web, 4(1), p. 14 - 28, http://staffwww.dcs.shef.ac.uk/people/J.Iria/iria_jws06.pdf[permanent dead link], (retrieved: 18.06.2012).
- ^ Cimiano, Philipp; Völker, Johanna (2005). "Text2Onto - A Framework for Ontology Learning and Data-Driven Change Discovery", Proceedings of the 10th International Conference of Applications of Natural Language to Information Systems, 3513, p. 227 - 238, http://www.cimiano.de/Publications/2005/nldb05/nldb05.pdf (retrieved: 18.06.2012).
- ^ Maedche, Alexander; Volz, Raphael (2001). "The Ontology Extraction & Maintenance Framework Text-To-Onto", Proceedings of the IEEE International Conference on Data Mining, http://users.csc.calpoly.edu/~fkurfess/Events/DM-KM-01/Volz.pdf (retrieved: 18.06.2012).
- ^ Machine Linking. "We connect to the Linked Open Data cloud", http://thewikimachine.fbk.eu/html/index.html Archived 2012-07-19 at the Wayback Machine (retrieved: 18.06.2012).
- ^ Inxight Federal Systems (2008). "Inxight ThingFinder and ThingFinder Professional", http://inxightfedsys.com/products/sdks/tf/ Archived 2012-06-29 at the Wayback Machine (retrieved: 18.06.2012).
- ^ Frawley William. F. et al. (1992), "Knowledge Discovery in Databases: An Overview", AI Magazine (Vol 13, No 3), 57-70 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1011 Archived 2016-03-04 at the Wayback Machine)
- ^ Fayyad U. et al. (1996), "From Data Mining to Knowledge Discovery in Databases", AI Magazine (Vol 17, No 3), 37-54 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1230 Archived 2016-05-04 at the Wayback Machine
- ^ Cao, L. (2010). "Domain driven data mining: challenges and prospects". IEEE Transactions on Knowledge and Data Engineering. 22 (6): 755–769. CiteSeerX 10.1.1.190.8427. doi:10.1109/tkde.2010.32. S2CID 17904603.
Knowledge extraction
View on GrokipediaIntroduction
Definition and Scope
Knowledge extraction is the process of identifying, retrieving, and structuring implicit or explicit knowledge from diverse data sources to produce usable, machine-readable representations such as knowledge graphs or ontologies. This involves transforming raw data into semantically meaningful forms that capture entities, relationships, and facts, facilitating advanced reasoning and application integration.[5] The key objectives of knowledge extraction include automating the acquisition of domain-specific knowledge from vast datasets, thereby reducing manual annotation efforts; enabling semantic interoperability by standardizing representations across heterogeneous systems; and supporting informed decision-making in artificial intelligence systems through enhanced contextual understanding and inference capabilities. These goals address the challenges of scaling knowledge representation in data-intensive environments, such as enabling AI models to leverage structured insights for tasks like question answering and recommendation.[5][6][7] While data mining focuses on discovering patterns and associations in data, knowledge extraction often emphasizes the creation of structured, semantically rich representations suitable for logical inference and interoperability.[1] It also differs from information retrieval, which focuses on identifying and ranking relevant documents or data snippets in response to user queries based on similarity measures, typically returning unstructured or semi-structured results. Knowledge extraction, however, actively parses and organizes content to generate structured outputs like entity-relation triples, moving beyond mere retrieval to knowledge synthesis.[8][9] The scope of knowledge extraction spans structured sources like databases, semi-structured formats such as XML or JSON, and unstructured data including text corpora and multimedia, aiming to bridge the gap between raw information and actionable knowledge. It excludes foundational data preprocessing steps like cleaning or normalization, as well as passive storage mechanisms, concentrating instead on the interpretive and representational transformation of content.[5][7]Historical Development
The roots of knowledge extraction trace back to the 1970s and 1980s, when artificial intelligence research emphasized expert systems that required manual knowledge acquisition from domain experts to encode rules and facts into computable forms.[10] A seminal example is the MYCIN system, developed at Stanford University in 1976, which used backward-chaining inference to diagnose bacterial infections and recommend antibiotics based on approximately 450 production rules derived from medical expertise.[11] This era highlighted the "knowledge bottleneck," where acquiring and structuring human expertise proved labor-intensive, laying foundational concepts for later automated extraction techniques from diverse data sources.[12] The 1990s marked a pivotal shift toward automated information extraction from text, driven by the need to process unstructured natural language data at scale. The Message Understanding Conferences (MUC), initiated in 1991 under DARPA sponsorship, standardized evaluation benchmarks for extracting entities, relations, and events from news articles, focusing initially on terrorist incidents in Latin America.[13] MUC-3 in 1991 introduced template-filling tasks with metrics like recall and precision, fostering rule-based and early machine learning approaches that achieved modest performance, such as 50-60% F1 scores on coreference resolution.[14] These conferences evolved through MUC-7 in 1998, influencing the broader field by emphasizing scalable extraction pipelines.[15] In the 2000s, the semantic web paradigm propelled knowledge extraction toward structured, interoperable representations, with the World Wide Web Consortium (W3C) standardizing RDF in 1999 and OWL in 2004 to enable ontology-based knowledge modeling and inference.[16] The Semantic Web Challenge, launched in 2003 alongside the International Semantic Web Conference, encouraged innovative applications integrating extracted knowledge, such as querying distributed RDF data for tourism recommendations.[17] A landmark milestone was the DBpedia project in 2007, which automatically extracted over 2 million RDF triples from Wikipedia infoboxes, creating the first large-scale, multilingual knowledge base accessible via SPARQL queries and serving as a hub for linked open data.[18] The 2010s saw knowledge extraction integrate with big data ecosystems and advanced natural language processing, culminating in the widespread adoption of knowledge graphs for search and recommendation systems. Google's Knowledge Graph, announced in 2012, incorporated billions of facts from sources like Freebase and Wikipedia to disambiguate queries and provide entity-based answers, improving search relevance by connecting over 500 million objects and 3.5 billion facts.[19] This era emphasized hybrid extraction methods combining rule-based parsing with statistical models, scaling to web-scale data. Post-2020, the AI boom, particularly large language models (LLMs), has revolutionized extraction by enabling zero-shot entity and relation identification from unstructured text, with surveys highlighting LLM-empowered knowledge graph construction that reduces manual annotation needs and enhances factual accuracy in domains like biomedicine. For instance, in biomedical knowledge mining, a retrieval-augmented method using LLMs improved document retrieval F1 score by 20% and answer generation accuracy by 25% over baselines, bridging semantic web foundations with generative AI for dynamic knowledge updates.[20]Extraction from Structured Sources
Relational Databases to Semantic Models
Knowledge extraction from relational databases to semantic models involves transforming structured tabular data into RDF triples or knowledge graphs, enabling semantic querying and interoperability. This process typically employs direct mapping techniques that convert database schemas and instances into RDF representations without extensive restructuring. In a basic 1:1 transformation, each row in a relational table is mapped to an RDF instance (subject), while columns define properties (predicates) linked to cell values as objects.[21][22] Direct mapping approaches, such as those defined in the W3C's RDB Direct Mapping specification, automate this conversion by treating tables as classes and attributes as predicates, generating RDF from the database schema and content on-the-fly. For instance, a table named "Customers" with columns "ID", "Name", and "Email" would produce triples where each customer row becomes a subject URI like<http://example.com/customer/{ID}>, with predicates such as ex:name and ex:email pointing to the respective values. These mappings preserve the relational structure while exposing it semantically, facilitating integration with ontologies.[21][22]
Schema alignment addresses relationships across tables, particularly foreign keys, which are interpreted as RDF links between instances. Tools like D2RQ enable virtual mappings by defining correspondences between relational schemas and RDF vocabularies, rewriting SPARQL queries to SQL without data replication. Similarly, the R2RML standard supports customized triples maps with referencing object maps to join tables via foreign keys, using conditions like rr:joinCondition to link child and parent columns. This allows, for example, an "Orders" table foreign key to "Customers.ID" to generate triples connecting order instances to customer subjects.[23][21]
Challenges in this conversion include handling database normalization, where denormalized views may be needed to avoid fragmented RDF graphs from vertically partitioned relations, and data type mismatches, such as converting SQL DATE to RDF xsd:date or xsd:dateTime via explicit mappings. Solutions involve declarative rules in R2RML to override defaults, ensuring literals match XML Schema datatypes, and tools like D2RQ's generate-mapping utility to produce initial alignments that can be refined manually. Normalization issues are mitigated by creating R2RML views that denormalize data through SQL joins before RDF generation.[22][21][23]
A representative example is mapping a customer-order database. Consider two tables: "Customers" (columns: CustID [INTEGER PRIMARY KEY], Name [VARCHAR], Email [VARCHAR]) and "Orders" (columns: OrderID [INTEGER PRIMARY KEY], CustID [INTEGER FOREIGN KEY], Product [VARCHAR], Amount [DECIMAL]).
Step-by-step mapping rules using R2RML:
- Triples Map for Customers: Define a logical table as
rr:tableName "Customers". Set subject map:rr:template "http://example.com/customer/{CustID}",rr:class ex:Customer. Add predicate-object maps:rr:predicate ex:name,rr:objectMap [ rr:column "Name" ]; andrr:predicate ex:email,rr:objectMap [ rr:column "Email"; rr:datatype xsd:string ]. This generates triples like<http://example.com/customer/101> rdf:type ex:Customer . <http://example.com/customer/101> ex:name "Alice Smith" . <http://example.com/customer/101> ex:email "[email protected]" .[21] - Triples Map for Orders: Logical table:
rr:tableName "Orders". Subject map:rr:template "http://example.com/order/{OrderID}",rr:class ex:Order. Predicate-object maps:rr:predicate ex:product,rr:objectMap [ rr:column "Product" ];rr:predicate ex:amount,rr:objectMap [ rr:column "Amount"; rr:datatype xsd:decimal ]. Include a referencing object map for the foreign key:rr:predicate ex:customer,rr:parentTriplesMap <#CustomerMap>,rr:joinCondition [ rr:child "CustID"; rr:parent "CustID" ]. For a row with OrderID=201, CustID=101, Product="Laptop", Amount=999.99, this yields<http://example.com/order/201> rdf:type ex:Order . <http://example.com/order/201> ex:product "Laptop" . <http://example.com/order/201> ex:amount "999.99"^^xsd:decimal . <http://example.com/order/201> ex:customer <http://example.com/customer/101> .[21]
XML and Other Markup Languages
Knowledge extraction from XML documents leverages the hierarchical, tag-based structure of markup languages to identify and transform data into structured representations, such as semantic models like RDF. XML, designed for encoding documents with explicit tags denoting content meaning, facilitates precise querying and mapping of elements to knowledge entities, enabling the conversion of raw markup into ontologies or triple stores. This process is particularly effective for sources like configuration files, data exchanges, and publishing formats where schema information guides extraction. XML parsing techniques form the foundation of extraction, utilizing standards like XPath for navigating document trees, XQuery for declarative querying, and XSLT for stylesheet-based transformations. XPath allows selection of nodes via path expressions, such as/product/category[name='electronics']/item, to isolate relevant elements for knowledge representation. XQuery extends this by supporting functional queries that aggregate and filter data, often outputting results in formats amenable to semantic processing. For instance, XQuery can join multiple XML documents and project attributes into RDF triples, streamlining the extraction of relationships like product hierarchies. XSLT, in turn, applies rules to transform XML into RDF/XML, using templates to map tags to predicates and attributes to objects; a seminal approach embeds XPath within XSLT to generate triples dynamically, as demonstrated in streaming transformations for large-scale data. These tools ensure efficient, schema-aware parsing without full document loading, crucial for knowledge extraction pipelines.[24]
Schema-driven extraction enhances accuracy by inferring ontologies from XML Schema Definition (XSD) files, which define element types, constraints, and hierarchies. XSD complex types can be mapped to ontology classes, with attributes becoming properties and nesting indicating subclass relations; for example, an XSD element <product> with sub-elements like <price> and <description> infers a Product class with price and description predicates. Automated tools mine these schemas to generate OWL ontologies, preserving cardinality and data types while resolving ambiguities through pattern recognition. This method has been formalized in approaches that construct deep semantics, such as identifying inheritance via extension/restriction in XSD, yielding reusable knowledge bases from schema repositories. By grounding extraction in XSD, the process minimizes manual annotation and supports validation during transformation.[25]
XML builds on predecessors like SGML, the Standard Generalized Markup Language, which introduced generalized tagging for document interchange in the 1980s, influencing XML's design for portability and extensibility. Modern publishing formats, such as DocBook—an XML vocabulary for technical documentation—extend this legacy by embedding semantic markup that aids extraction; for instance, DocBook's <book> and <chapter> elements can be transformed via XSLT to RDF, capturing structural knowledge like authorship and sections for knowledge graphs. These evolutions emphasize markup's role in facilitating semantic interoperability.
A representative case study involves extracting product catalogs from XML feeds, common in e-commerce platforms like Amazon's data feeds, into knowledge bases. In one implementation, XPath queries target elements such as <item><name> and <price>, while XSLT maps them to RDF triples (e.g., ex:product rdf:type ex:Item; ex:hasName "Laptop"; ex:hasPrice 999), integrating with SPARQL endpoints for querying. This approach, tested on feeds with thousands of entries, achieves high precision in entity resolution and relation extraction, enabling applications like recommendation systems; GRDDL profiles further standardize such transformations by associating XSLT scripts with XML via profiles, as used in syndication scenarios.[26]
Tools and Direct Mapping Techniques
One prominent tool for direct mapping relational databases to RDF is the D2RQ Platform, developed since 2004 at Freie Universität Berlin.[23][27] D2RQ enables access to relational databases as virtual, read-only RDF graphs by using a declarative mapping language that relates database schemas to RDF vocabularies or OWL ontologies.[28] This approach allows for on-the-fly translation of SQL queries into SPARQL without materializing the RDF data, facilitating integration of legacy databases into semantic web applications.[29] Building on such efforts, the W3C standardized R2RML (RDB to RDF Mapping Language) in September 2012 as a recommendation for expressing customized mappings from relational databases to RDF datasets.[21][30] R2RML defines mappings through triple maps, which associate logical tables (such as SQL queries or base tables) with RDF triples, enabling tailored views of the data while preserving relational integrity.[21] Unlike earlier tools, R2RML's standardization promotes interoperability across processors, with implementations supporting both virtual and materialized RDF views.[31] At the core of these tools are rule-based mappers that generate RDF terms deterministically from database rows. For instance, subject maps and predicate-object maps in R2RML use template maps to construct IRIs for entities, such ashttp://example.com/Person/{id} where {id} is a placeholder for a column value like a primary key.[21] Similarly, D2RQ employs property bridges and class maps to define IRI patterns based on column values, ensuring that entities and relations are linked without custom scripting.[28] These rules are compiled into SQL views at runtime, translating SPARQL patterns into efficient relational queries.[30]
Performance in these systems often revolves around query federation through SPARQL endpoints, as provided by the D2R Server component of D2RQ.[32] Simple triple pattern queries can achieve performance comparable to hand-optimized SQL, but complexity increases with joins or filters, potentially leading to exponential SQL generation due to the mapping's declarative nature.[27] R2RML processors similarly expose endpoints for federated queries, though optimization relies on database indexes and primary keys to mitigate translation overhead.[32]
Direct mapping techniques, however, have limitations when applied to non-ideal schemas, such as denormalized data where redundancy violates normalization principles. In these cases, automated IRI generation may produce duplicate entities or incorrect relations, as the mapping assumes one-to-one correspondences that do not hold in denormalized tables.[33][34] For example, a denormalized table repeating customer details across orders could yield multiple identical RDF subjects, necessitating advanced customization or preprocessing to maintain semantic accuracy.[35] Such shortcomings often require shifting to more sophisticated methods for complex schemas.
Extraction from Semi-Structured Sources
JSON and NoSQL Databases
Knowledge extraction from JSON documents leverages the format's hierarchical and flexible structure to identify entities, properties, and relationships that can be mapped to semantic representations such as RDF triples. JSONPath serves as a query language analogous to XPath for XML, enabling precise navigation and extraction of data from JSON structures without requiring custom scripting. For instance, expressions like$.store.book[0].title allow traversal of nested objects and arrays to retrieve specific values, facilitating the isolation of potential knowledge elements like entities or attributes.[36]
Transformation of extracted JSON data into RDF is standardized through JSON-LD, a W3C Recommendation from 2014 that embeds contextual mappings within JSON to serialize Linked Data. JSON-LD uses a @context to map JSON keys to IRIs from ontologies, enabling automatic conversion of documents into RDF graphs where nested structures represent classes and properties; for example, a JSON object { "name": "Alice", "friend": { "name": "Bob" } } with appropriate context can yield triples like <Alice> <foaf:knows> <Bob> .. This approach supports schema flexibility in semi-structured data, allowing knowledge extraction without rigid predefined schemas.[37]
NoSQL databases amplify these techniques due to their schema-less nature, which mirrors JSON's variability but scales to distributed environments. In document-oriented stores like MongoDB, extraction involves querying collections of JSON-like BSON documents and mapping them to RDF via formal definitions of document structure; one method parses nested fields into subject-predicate-object triples, constructing knowledge graphs by inferring relations from embedded arrays and objects.[38] Graph databases such as Neo4j, using Cypher query language, handle inherently relational data; the Neosemantics plugin exports Cypher results directly to RDF formats like Turtle or JSON-LD, preserving graph traversals as semantic edges without loss of connectivity.[39]
Schema inference automates the discovery of implicit structures in JSON and NoSQL data, treating nested objects as potential classes and their keys as properties to generate ontologies dynamically. Algorithms process datasets in parallel, inferring types for values (e.g., strings, numbers, arrays) and fusing them across documents to mark optional fields or unions, as in approaches using MapReduce-like steps on tools like Apache Spark; this detects hierarchies where, for example, repeated nested objects indicate class instances with inherited properties.[40]
A representative example is extracting knowledge from social media feeds stored in JSON format, such as Twitter data. Processing tweet JSON objects—containing fields like user, text, and timestamps—applies named entity recognition and relation extraction to generate RDF triples; for instance, from a tweet like "Norway bans petrol cars," tools identify entities (Norway as Location, petrol as Fuel) and relations (ban), yielding triples such as <Norway> <bans> <petrol> . enriched with Schema.org vocabulary, forming a dataset queryable via SPARQL for insights like pollution policies.[41]
Web Data and APIs
Knowledge extraction from web data and APIs involves retrieving and structuring semi-structured information from online sources, such as RESTful endpoints and HTML-embedded markup, to populate knowledge graphs or semantic models. REST APIs typically return data in JSON or XML formats, which can be parsed to identify entities, attributes, and relationships. For instance, JSON responses from APIs are processed using schema inference tools to generate RDF triples, enabling integration with ontologies like schema.org, a collaborative vocabulary for marking up web content with structured data.[42] Schema.org provides extensible schemas that map API outputs to semantic concepts, such as products or events, facilitating automated extraction without custom parsers in many cases.[43] Web scraping techniques target semi-structured elements embedded in HTML, including Microdata and RDFa, which encode metadata directly within page content. Microdata uses HTML attributes like itemscope and itemprop to denote structured items, while RDFa extends XHTML with RDF syntax for richer semantics. Tools like the Any23 library parse these formats to extract RDF quads from web corpora, as demonstrated by the Web Data Commons project, which has processed billions of pages from the Common Crawl to yield datasets of over 70 billion triples.[44] This approach allows extraction of schema.org-compliant data, such as organization details or reviews, directly from webpages, converting them into knowledge graph nodes and edges.[45] Ethical and legal considerations are paramount in web data extraction to ensure compliance and sustainability. Practitioners must respect robots.txt files, a standard protocol that instructs crawlers on permissible site access, preventing overload or unauthorized scraping.[46] Additionally, under the EU's General Data Protection Regulation (GDPR), extracting personal data—such as user identifiers from API responses—requires lawful basis and consent, with non-compliance risking fines up to 4% of global turnover.[47] Rate limiting, typically implemented via delays between requests, mitigates server strain and aligns with terms of service, promoting responsible data acquisition.[48] A representative case is the extraction of e-commerce product data via Amazon's APIs, which provide JSON endpoints for item attributes like price, description, and reviews. Amazon has leveraged such data in constructing commonsense knowledge graphs to enhance product recommendations, encoding relationships between items (e.g., "compatible with") using graph databases like Neptune. This process involves parsing API responses with schema.org vocabularies to infer entities and relations, yielding graphs that support real-time querying for over a billion products.[49][50]Parsing and Schema Inference Methods
Parsing and schema inference methods address the challenge of deriving structured representations from semi-structured data, such as JSON or XML, where explicit schemas are absent or inconsistent. These methods involve analyzing the data's internal structure, identifying recurring patterns in fields, types, and relationships, and generating a schema that captures the underlying organization without requiring predefined mappings. Unlike direct mapping techniques from structured sources, which rely on rigid predefined schemas, inference approaches handle variability by clustering similar elements and resolving ambiguities algorithmically.[51] Inference techniques often employ record linkage to identify and group similar fields across records, treating field names and values as entities to be matched despite variations in naming or format. For instance, edit distance metrics, such as Levenshtein distance, measure the similarity between field names by calculating the minimum number of single-character edits needed to transform one string into another, enabling the merging of semantically equivalent fields like "user_name" and "username." This process facilitates schema normalization by linking disparate representations into unified attributes, improving data integration in semi-structured datasets.[52][53] Tools like OpenRefine support schema inference through data cleaning and transformation workflows, allowing users to cluster similar values, facet data by types, and export reconciled structures to formats such as JSON Schema or RDF. OpenRefine processes semi-structured inputs by iteratively refining clusters based on user-guided or automated similarity thresholds, enabling the detection of field types and hierarchies without manual schema design. Additionally, specialized JSON Schema inference libraries, such as those implementing algorithms from the EDBT'17 framework, automate the generation of schemas from sample JSON instances by analyzing type distributions and nesting patterns across records.[40] Probabilistic models enhance schema inference by estimating field types under uncertainty, particularly in datasets with mixed or evolving formats. Basic Bayesian approaches compute the posterior probability of a type given observed values, using Bayes' theorem as $ P(\text{type} \mid \text{value}) = \frac{P(\text{value} \mid \text{type}) \cdot P(\text{type})}{P(\text{value})} $, where priors reflect common data patterns (e.g., strings for names) and likelihoods are derived from value characteristics like length or format. This enables robust type prediction for fields exhibiting variability, such as numeric identifiers that may appear as strings.[54] A typical workflow for schema inference begins with parsing raw JSON to extract key-value pairs and nested objects, followed by applying linkage and probabilistic techniques to cluster fields and infer types. The resulting schema is then mapped to an ontology by translating JSON structures into classes and properties, often using rule-based transformations to align with standards like OWL. Validation steps involve sampling additional records against the inferred schema to measure coverage and accuracy, iterating refinements if discrepancies exceed thresholds, ensuring the ontology supports downstream knowledge extraction tasks.[55]Extraction from Unstructured Sources
Natural Language Processing Foundations
Natural language processing (NLP) forms the bedrock for extracting knowledge from unstructured textual sources by enabling the systematic analysis of linguistic structures. At its core, the NLP pipeline begins with tokenization, which breaks down raw text into smaller units such as words, subwords, or characters, facilitating subsequent processing steps. This initial phase addresses challenges like handling punctuation, contractions, and language-specific orthographic rules, ensuring that text is segmented into meaningful tokens for further analysis. For instance, in English, tokenization typically splits sentences on whitespace while resolving ambiguities like "don't" into "do" and "n't".[56] Following tokenization, part-of-speech (POS) tagging assigns grammatical categories—such as nouns, verbs, adjectives—to each token based on its syntactic role and context. This step relies on probabilistic models trained on annotated corpora to disambiguate words with multiple possible tags, like "run" as a verb or noun. A seminal advancement in POS tagging came in the 1990s with the adoption of statistical models, particularly Hidden Markov Models (HMMs), which model sequences of tags as hidden states emitting observed words, achieving accuracies exceeding 95% on standard benchmarks.[57] Dependency parsing extends this by constructing a tree representation of syntactic relationships between words, identifying heads (governors) and dependents to reveal phrase structures and grammatical dependencies. Tools like the Stanford Parser employ unlexicalized probabilistic context-free grammars to produce dependency trees with high precision, often around 90% unlabeled attachment score on Wall Street Journal data. These parses are crucial for understanding sentence semantics, such as subject-verb-object relations, without relying on full constituency trees.[58] Linguistic resources underpin these techniques by providing annotated data and lexical knowledge. The Penn Treebank, a large corpus of over 4.5 million words from diverse sources like news articles, offers bracketed syntactic parses and POS tags, serving as a primary training dataset for statistical parsers since its release. Complementing this, WordNet (1995) organizes English words into synsets—groups of synonyms linked by semantic relations like hypernymy—enabling inference of word meanings and relations for tasks like disambiguation.[59][60] As a key preprocessing step in the pipeline, Named Entity Recognition (NER) identifies and classifies entities such as persons, organizations, and locations within text, typically using rule-based patterns or statistical classifiers trained on annotated examples. Early NER efforts, formalized during the Sixth Message Understanding Conference (MUC-6) in 1995, focused on extracting entities from news texts with F1 scores around 90% for core types, laying the groundwork for scalable entity detection without domain-specific tuning.[61] The evolution of these NLP foundations traces from rule-based systems in the 1960s, exemplified by ELIZA, which used hand-crafted pattern matching to simulate dialogue, to statistical paradigms in the 1990s that leveraged probabilistic models like HMMs for robust handling of ambiguity and variability in natural language.[62] This shift enabled more data-driven approaches, improving accuracy and scalability for knowledge extraction pipelines.Traditional Information Extraction
Traditional information extraction encompasses rule-based methods that employ predefined patterns and heuristics to identify and extract entities, such as names, dates, and organizations, as well as relations between them from unstructured text. These approaches originated in the 1990s through initiatives like the Message Understanding Conferences (MUC), where systems competed to process natural language documents into structured templates using cascading rule sets.[13] Unlike later data-driven techniques, traditional methods prioritize explicit linguistic rules derived from domain expertise, often applied after basic NLP preprocessing like tokenization and part-of-speech tagging to segment and annotate text.[13] A core technique in traditional information extraction is pattern matching, frequently implemented via regular expressions to capture syntactic structures indicative of target information. For instance, a regular expression such as\b[A-Z][a-z]+ [A-Z][a-z]+\b can match person names by targeting capitalized word sequences, while patterns like \d{1,2}/\d{1,2}/\d{4} extract dates in MM/DD/YYYY format.[63] More sophisticated systems extend this to relational patterns, such as proximity-based rules that link entities (e.g., "CEO of [Organization]") to infer roles without deep semantic analysis. The GATE framework, released in 1996, exemplifies this by enabling developers to build modular pipelines of processing resources, including finite-state transducers and cascades for sequential entity recognition followed by relation extraction and co-reference resolution.[64] In GATE, rules are often specified in JAPE (Java Annotation Pattern Engine), allowing patterns like {Token.kind == uppercase, Token.string == "Inc."} to tag corporate entities, which then feed into higher-level relation cascades.[64]
Evaluation of traditional information extraction systems traditionally employs precision, recall, and the harmonic mean F1-score, metrics standardized in the MUC evaluations to measure extracted items against gold-standard annotations. Precision (P) is the ratio of correctly extracted items to total extracted items, recall (R) is the ratio of correctly extracted items to total relevant items in the text, and F1-score balances them as follows:
Ontology-Based and Semantic Extraction
Ontology-based information extraction (OBIE) leverages predefined ontologies to guide the identification and structuring of entities, relations, and events from text, ensuring extracted knowledge aligns with a formal semantic model.[66] Unlike traditional information extraction, which relies on general patterns, OBIE maps text spans—such as named entities or phrases—to specific ontology classes and properties, often using rule-based systems or machine learning models trained on ontology schemas.[67] This process typically involves three stages: recognizing relevant text elements, classifying them according to ontology concepts, and populating the ontology with instances and relations.[66] In the OBIE workflow, rules or classifiers disambiguate and categorize extracted elements by referencing the ontology's hierarchical structure and constraints. For instance, a rule-based approach might use lexical patterns combined with ontology axioms to link a mention like "Paris" to the class City rather than a person's name, while machine learning methods employ supervised classifiers fine-tuned on annotated corpora aligned with the ontology.[67] Tools such as Ontotext's PoolParty facilitate this by integrating ontology management with extraction pipelines; for example, PoolParty can import the DBpedia ontology to automatically tag entities in text, extracting instances of classes like Person or Organization and linking them to DBpedia URIs for semantic enrichment.[68][69] Semantic annotation standards further support OBIE by enabling the markup of text with RDF triples that conform to the ontology. The Evaluation and Report Language (EARL) 1.0, a W3C Working Draft schema from the early to mid-2000s, provides a framework for representing annotations as RDF statements, allowing tools to assert properties like dc:subject or foaf:depicts directly on text fragments.[70] This RDF-based markup ensures interoperability, as annotations can be queried and integrated into larger knowledge bases using SPARQL.[71] A key advantage of ontology-based methods is their ability to enforce consistency and resolve ambiguities in extracted knowledge. For example, in processing the term "Apple," contextual analysis guided by an ontology like DBpedia can distinguish between the Fruit class (e.g., in a recipe) and the Company class (e.g., in a business report), preventing erroneous linkages and improving downstream applications such as question answering.[67][72] This structured guidance reduces errors compared to pattern-only approaches.[66]Advanced Techniques
Machine Learning and AI-Driven Extraction
Machine learning approaches to knowledge extraction have evolved from traditional supervised techniques to advanced deep learning and large language models, enabling automated identification and structuring of entities, relations, and concepts from diverse data sources. Supervised methods, particularly Conditional Random Fields (CRFs), have been foundational for tasks like Named Entity Recognition (NER), where models are trained to assign labels to sequences of tokens representing entities such as persons, organizations, and locations. Introduced as probabilistic models for segmenting and labeling sequence data, CRFs address limitations of earlier approaches like Hidden Markov Models by directly modeling conditional probabilities and avoiding label bias issues. These models are typically trained on annotated corpora, with the CoNLL-2003 dataset serving as a benchmark for English NER, containing over 200,000 tokens from Reuters news articles labeled for four entity types. Early applications demonstrated CRFs achieving F1 scores around 88% on this dataset, establishing their efficacy for structured extraction in knowledge bases.[73][74] Deep learning has advanced these capabilities through transformer architectures, which leverage self-attention mechanisms to capture long-range dependencies in text far more effectively than recurrent models. The transformer model, introduced in 2017, forms the backbone of modern systems for both NER and relation extraction by processing entire sequences in parallel. BERT (Bidirectional Encoder Representations from Transformers), released in 2018, exemplifies this shift; its pre-trained encoder is fine-tuned on task-specific data to excel in relation extraction, where it identifies semantic links between entities, such as "located_in" or "works_for," by treating the task as a classification over sentence spans. Fine-tuned BERT models have set state-of-the-art benchmarks, achieving F1 scores exceeding 90% on datasets like SemEval-2010 Task 8 for relation classification, outperforming prior methods by integrating contextual embeddings.[75] This fine-tuning process adapts the model's bidirectional understanding of context, making it particularly suited for extracting relational knowledge from unstructured text. Unsupervised methods complement supervised ones by discovering patterns without labeled data, often through clustering techniques that group similar textual elements to infer entities or topics. Latent Dirichlet Allocation (LDA), a generative probabilistic model from 2003, enables topic-based extraction by representing documents as mixtures of latent topics, where each topic is a distribution over words; this uncovers thematic structures that can reveal implicit entities or relations in corpora.[76] For instance, LDA has been applied to cluster news articles into topics like "politics" or "technology," facilitating entity discovery without annotations, as demonstrated in aspect extraction from reviews where it identifies opinion targets with coherence scores above 0.5 on benchmark sets. These approaches are valuable for scaling extraction to large, unlabeled datasets, though they require post-processing to map topics to structured knowledge. Recent advances in large language models (LLMs) have introduced zero-shot extraction, allowing models to perform knowledge extraction without task-specific training by leveraging emergent capabilities from vast pre-training. GPT-4, released in 2023, supports zero-shot relation and entity extraction through prompt engineering, achieving competitive F1 scores ranging from 67% to 98% on radiological reports for extracting clinical findings, rivaling supervised models in low-resource settings.[77] This extends to multimodal data, where models like GPT-4 process text-image pairs for integrated extraction; for example, systems using GPT-3.5 in zero-shot mode extract tags from images and captions, outperforming human annotations in precision and recall on datasets like Kuaishou.[78] As of 2025, subsequent models like GPT-4o have further improved zero-shot performance in such tasks. These developments shift knowledge extraction toward more flexible, generalizable AI systems, though challenges like hallucination persist.Knowledge Graph Construction
Knowledge graph construction involves assembling extracted entities and relations from various sources into a structured graph representation, typically through a series of interconnected steps that ensure coherence and usability. The process begins with entity linking, where identified entities from text or data are mapped to existing nodes in a knowledge base or new nodes are created if no matches exist. This step is crucial for avoiding duplicates and maintaining graph integrity, often employing similarity metrics such as the Jaccard index, which measures the overlap between sets of attributes or neighbors of candidate entities to determine matches. For instance, in embedding-assisted approaches like EAGER, Jaccard similarity is combined with graph embeddings to resolve entities across knowledge graphs by comparing neighborhood structures.[79] Following entity linking, relation inference identifies and extracts connections between entities, generating triples in the form of subject-predicate-object. These triples form the fundamental units of RDF graphs, as defined by the W3C RDF standard, where subjects and objects are resources (IRIs or blank nodes) and predicates denote relationships.[80] Models like REBEL, a sequence-to-sequence architecture based on BART, facilitate end-to-end relation extraction by linearizing triples into text sequences, enabling the population of graphs with over 200 relation types from unstructured input.[81] Graph population then integrates these triples into a cohesive structure, often adhering to vocabularies such as schema.org, which provides extensible schemas for entities like Dataset and Observation to enhance interoperability in knowledge graphs.[82] A key challenge in knowledge graph construction is scalability, particularly when handling billions of triples across massive datasets. For example, Wikidata grew to over 100 million entities by 2023, necessitating efficient algorithms for inference and resolution to manage exponential growth without compromising query performance.[83] Recent advancements, including large language models for joint entity-relation extraction, briefly reference machine learning techniques to automate these steps while addressing noise in extracted data.[84]Integration and Fusion Methods
Integration and fusion methods in knowledge extraction involve combining facts and entities derived from multiple heterogeneous sources to create a coherent, unified knowledge representation. These methods address challenges such as schema mismatches, redundant information, and inconsistencies by aligning structures, merging similar entities, and resolving discrepancies. The process ensures that the resulting knowledge base maintains high accuracy and completeness, often leveraging probabilistic models or rule-based approaches to weigh evidence from different extractors.[85] Fusion techniques commonly include ontology alignment, which matches concepts and relations across ontologies to enable interoperability. For instance, tools like OWL-Lite Alignment (OLA) compute similarities between OWL entities based on linguistic and structural features to generate mappings.[86] Probabilistic merging extends this by treating knowledge as uncertain triples and fusing them using statistical models, such as supervised latent Dirichlet allocation, to estimate the probability of truth for each fact across sources.[87] These approaches prioritize high-confidence alignments, reducing errors in cross-ontology integration. Conflict resolution during fusion relies on mechanisms like voting and confidence scoring to reconcile differing extractions. Majority voting aggregates predictions from multiple extractors, selecting the most frequent assertion for a given fact, while weighted voting incorporates confidence scores—probabilities output by extraction models—to favor reliable sources.[88] For example, in knowledge graph construction, facts with conflicting attributes are resolved by thresholding low-confidence scores or applying source-specific weights derived from historical accuracy.[89] Standards such as the Linked Data principles, outlined by Tim Berners-Lee in 2006, guide fusion by emphasizing the use of URIs for entity identification, dereferenceable HTTP access, and RDF-based descriptions to facilitate linking across datasets. The SILK framework implements these principles through a declarative link discovery language, enabling scalable matching of entities based on similarity metrics like string distance and data type comparisons.[90] A prominent example is Google's Knowledge Vault project from 2014, which fused probabilistic extractions from web content with prior knowledge from structured bases like Freebase to construct a web-scale knowledge repository containing 1.6 billion facts, of which 271 million were rated as confident.[91] This system applied machine learning to propagate confidence across sources, achieving a 30% improvement in precision over single-source baselines by resolving conflicts through probabilistic inference.[92]Applications and Examples
Entity Linking and Resolution
Entity linking and resolution is a critical step in knowledge extraction that connects entity mentions identified in text—such as person names, locations, or organizations—to their corresponding entries in a structured knowledge base, like Wikipedia or YAGO, while resolving ambiguities arising from multiple possible referents for the same mention.[93] This process typically follows named entity recognition (NER) from traditional information extraction methods and enhances the semantic understanding of unstructured text by grounding it in a verifiable knowledge source.[94] The process begins with candidate generation, where potential knowledge base entities are retrieved for each mention using techniques such as surface form matching against Wikipedia titles, redirects, and anchor texts to create a shortlist of plausible candidates, often limited to the top-k most relevant ones to manage computational efficiency.[93] Disambiguation then resolves the correct entity by comparing the local context around the mention—such as surrounding words or keyphrases—with entity descriptions, commonly via vector representations like bag-of-words or cosine similarity, and incorporating global coherence across all mentions in the document to ensure consistency, for instance, by modeling entity relatedness through shared links in the knowledge base.[94] Key algorithms include AIDA, introduced in 2011, which employs a graph-based approach for news articles by constructing a mention-entity bipartite graph weighted by popularity priors, contextual similarity (using keyphrase overlap), and collective coherence (via in-link overlap in Wikipedia), then applying a greedy dense subgraph extraction for joint disambiguation to achieve global consistency.[94] Collective classification methods, such as those in AIDA, extend local decisions by propagating information across mentions, outperforming independent ranking in ambiguous contexts through techniques like probabilistic graphical models or iterative optimization.[93] Evaluation metrics for entity linking emphasize linking accuracy, with micro-F1 scores commonly reported on benchmarks like the AIDA-YAGO dataset, where AIDA achieves approximately 82% micro precision at rank 1, reflecting strong performance in disambiguating mentions from CoNLL-2003 news texts linked to YAGO entities.[94] These metrics account for both correct links and handling of unlinkable mentions (NILs), providing a balanced measure of precision and recall in real-world scenarios.[93] In applications, entity linking enhances search engines by enabling semantic retrieval, where disambiguated entities improve query understanding and result relevance, as demonstrated in systems that integrate linking with entity retrieval to support entity-oriented search over large document collections.[95]Domain-Specific Use Cases
In healthcare, knowledge extraction plays a pivotal role in processing electronic health records (EHRs) to identify and standardize medical entities, enabling better clinical decision-making and research. The Unified Medical Language System (UMLS) ontology is widely employed to map unstructured clinical text from EHRs to standardized concepts, facilitating the integration of diverse data sources into relational databases for analysis.[96] For instance, UMLS-based methods extract and categorize signs and symptoms from clinical narratives, linking them to anatomical locations to support diagnostic applications.[97] During the 2020s, AI-driven extraction techniques were extensively applied to COVID-19 literature, where models annotated mechanisms and relations from scientific papers to build knowledge bases that accelerated vaccine development and treatment insights.[98] These efforts, often leveraging entity linking to connect extracted terms to established biomedical ontologies, have demonstrated substantial efficiency gains, such as reducing manual annotation workloads by approximately 80% in collaborative human-LLM frameworks for screening biomedical texts.[99] In the finance sector, knowledge extraction from regulatory reports like SEC 10-K filings involves sentiment analysis to detect linguistic indicators of deception, such as overly positive or evasive language, which aids in identifying potential fraud.[100] Relation extraction further enhances this by constructing graphs that model connections between financial entities, such as supplier-customer relationships or anomalous transaction patterns, to flag fraudulent activities in financial statements.[101] For example, contextual language models applied to textual disclosures in annual reports have achieved high accuracy in fraud detection by quantifying sentiment shifts and relational inconsistencies, improving regulatory oversight and risk assessment.[102] Such applications yield significant ROI, as automated extraction reduces the time and cost associated with manual audits, enabling proactive fraud prevention in large-scale financial datasets. E-commerce platforms utilize knowledge extraction to derive product insights from customer reviews, constructing knowledge graphs that capture attributes, sentiments, and relations for enhanced recommendations. Amazon's approaches in the early 2020s, for instance, embed review texts into knowledge graphs using techniques like knowledge graph embedding and sentiment analysis, allowing the system to infer commonsense relationships between products and user preferences.[103] By 2023, review-enhanced knowledge graphs integrated multimodal data from Amazon datasets, improving recommendation accuracy by incorporating fine-grained features like aspect-based sentiments from user feedback.[104] This results in more personalized suggestions, boosting customer engagement and sales conversion rates through scalable, automated knowledge fusion from unstructured review corpora.Evaluation Metrics and Challenges
Evaluation of knowledge extraction systems relies on a combination of intrinsic and extrinsic metrics to assess both the quality of extracted elements and their utility in broader applications. Intrinsic metrics focus on the direct performance of extraction components, such as precision, recall, and F1-score, which measure the accuracy of identifying entities, relations, and events against ground-truth annotations.[105] These metrics evaluate internal consistency and coverage, for instance, by calculating precision as the ratio of true positives to the sum of true and false positives, recall as true positives over true positives plus false negatives, and F1 as their harmonic mean.[105] In knowledge graph construction, additional intrinsic measures like mean reciprocal rank (MRR) and root mean square error (RMSE) assess embedding quality and prediction accuracy for links or numerical attributes.[105] Extrinsic metrics, in contrast, gauge the effectiveness of extracted knowledge in downstream tasks, such as question answering or recommendation systems, where success is tied to overall task performance rather than isolated extraction fidelity.[105] For entity linking, common extrinsic metrics include Hits@K, which computes the fraction of correct entities ranked in the top K positions, and mean reciprocal rank (MRR), the average of the reciprocal ranks of true entities.[106] Hits@K is particularly useful for evaluating retrieval-based linking, as it prioritizes top-ranked results while ignoring lower ranks, with values ranging from 0 to 1 where higher indicates better performance.[106] These metrics highlight how well extracted entities integrate into knowledge bases for practical use, such as improving search relevance.[106] Despite advances in metrics, knowledge extraction faces significant challenges, including data privacy concerns amplified by regulations like the General Data Protection Regulation (GDPR), enacted in 2018. GDPR's principles of purpose limitation and data minimization require that personal data used in extraction processes align with initial collection purposes and be pseudonymized to reduce re-identification risks, particularly when AI infers sensitive attributes from unstructured text.[107] For instance, automated profiling in extraction can trigger Article 22 safeguards, mandating human oversight and transparency to protect data subjects' rights, though ambiguities in explaining AI logic persist.[107] Hallucinations in large language models (LLMs) pose another critical challenge, where models generate fabricated facts during relation or entity extraction, undermining knowledge graph reliability. Studies highlight that LLMs exhibit factual inconsistencies when constructing knowledge graphs from text, often due to overgeneralization or incomplete world knowledge.[108] For example, benchmarks like HaluEval reveal response-level hallucinations in extraction tasks, prompting the use of knowledge graphs for grounding via retrieval-augmented generation to verify outputs.[108] Bias issues further complicate extraction, stemming from underrepresentation in training datasets that skew results toward dominant demographics. In relation extraction datasets like NYT and CrossRE, women and Global South entities are underrepresented (11.8-20.0% for women), leading to allocative biases where certain relations are disproportionately assigned to overrepresented groups.[109] Representational biases manifest as stereotypical associations, such as linking women to "relationship" relations. Mitigation strategies include curating diverse corpora for pre-training, which can reduce gender bias by 3-5% but may inadvertently amplify geographic biases if not multi-axial.[109] Looking ahead, scalability remains a key challenge for real-time knowledge extraction, especially in resource-constrained environments, with ongoing developments in edge computing integration as of 2025 enabling low-latency processing. Edge AI supports low-latency processing by deploying lightweight models on distributed devices, addressing bandwidth limitations in applications like autonomous systems where extraction must occur in milliseconds.[110] Advances in dynamic resource provisioning and hybrid scaling will support scalable, privacy-preserving extraction at the edge, though challenges in hardware heterogeneity and model optimization persist.Modern Tools and Developments
Survey of Established Tools
Established tools for knowledge extraction encompass a range of mature software suites that handle processing from unstructured text, semantic representations, and structured data sources. These tools, developed primarily before 2023, provide robust pipelines for tasks like entity identification, relation extraction, and data mapping, forming the backbone of many knowledge extraction workflows.[111][112][113] In the domain of natural language processing (NLP) and information extraction (IE), the Stanford NLP suite stands as a foundational toolkit originating in the early 2000s, with its core parser released in 2002 and a unified CoreNLP package in 2010.[114] This Java-based suite includes annotators for part-of-speech tagging, named entity recognition (NER), dependency parsing, and open information extraction, enabling the derivation of structured knowledge from raw text through modular pipelines.[111] Widely adopted in academia and industry, it supports multilingual processing and integrates with Java ecosystems for scalable extraction.[115] Complementing this, spaCy, an open-source Python library first released in 2015, emphasizes efficiency and production-ready NLP pipelines for knowledge extraction.[112] It offers pre-trained models for tokenization, NER, dependency parsing, and lemmatization, with customizable components for rule-based and statistical extraction methods.[116] spaCy's architecture allows rapid processing of large corpora, making it ideal for extracting entities and relations from documents in real-world applications.[112] For semantic knowledge extraction, Protégé serves as a prominent ontology editor, with its modern version released in 2002 building on earlier prototypes from the 1980s and 1990s.[117] This free tool supports the development and editing of ontologies in OWL and RDF formats, facilitating the formalization of extracted knowledge into reusable schemas and taxonomies.[113] Protégé includes plugins for reasoning, visualization, and integration with IE outputs, aiding in the construction of domain-specific knowledge bases. Apache Jena, an open-source Java framework first released in 2000, specializes in handling RDF data for semantic extraction and storage.[118] It provides APIs for reading, writing, and querying RDF graphs using SPARQL, along with inference engines for deriving implicit knowledge from explicit extractions.[119] Jena's modular design supports triple stores and linked data applications, enabling the fusion of extracted triples into coherent knowledge graphs.[120] Addressing structured data extraction, Talend Open Studio for Data Integration, launched in 2006, functions as an ETL (extract, transform, load) platform with graphical job designers for mapping and transforming data.[121] It connects to databases, files, and APIs to extract relational data, applying transformations that can populate knowledge schemas or ontologies.[122] The tool's component-based approach supports schema inference and data quality checks, essential for integrating structured sources into broader knowledge extraction pipelines; however, the open-source version was discontinued in 2024.[123] These tools draw on established extraction methods, such as rule-based pattern matching and probabilistic models, to process diverse inputs.[111] To compare their capabilities, the following table summarizes key aspects:| Tool | Key Features | Supported Sources | Open-Source Status |
|---|---|---|---|
| Stanford NLP Suite | POS tagging, NER, dependency parsing, open IE pipelines | Unstructured text (multilingual) | Yes (GPL) |
| spaCy | Tokenization, NER, lemmatization, customizable statistical pipelines | Unstructured text (English-focused, extensible) | Yes (MIT) |
| Protégé | Ontology editing, OWL/RDF support, reasoning plugins | Ontology files, semantic schemas | Yes (BSD) |
| Apache Jena | RDF manipulation, SPARQL querying, inference engines | RDF graphs, linked data | Yes (Apache 2.0) |
| Talend Open Studio | ETL jobs, data mapping, schema inference, quality profiling | Databases, files, APIs (structured) | Yes (GPL), discontinued 2024 |