Recognition and normalization of named entities in scientific text
Highlighting in text corresponding to the entity classes.
Up-to-date information about biomedical entities like genes, proteins, diseases or drugs often is not found in structured data bases but rather in scientific text. For specific information retrieval or information extraction the recognition of these terms and their normalization to data base entries (e.g. gene names to ENTREZ-GENE) or structured vocabulary/ontologies (e.g. GO/MESH/UMLS) is a prerequisite. The need of normalization implies the usage of dictionaries generated from these sources and the inclusion of direct mappings. As data bases and ontologies are evolving rapidly, automated updating and processing is needed to generate comprehensive and specific dictionaries. The high ambiguity of terms and acronyms used in the life science domain complicates precise recognition further.
Challenges for named entity recognition in biomedical text
Scientific publications found in abstract data bases, full text journals or patents are the main and most up-to-date information source, but the amount of text is overwhelming for most life science areas. Recognition of life science terminology is a key prerequisite for performing automatic information retrieval and information extraction. Huge and complex terminologies with high numbers of synonymous expressions, ambiguous terminology and numerous generations of new names and classes present named entity recognition with a real challenge. ProMiner is a tool for specific terminology recognition and addresses several fundamental issues in named entity recognition in the field of life sciences:
- ProMiner can handle voluminous dictionaries, complex thesauri and large controlled vocabularies derived from ontologies
- regularly updated dictionaries through automatic curation followed by a manualevaluation process
- mapping of synonyms to reference names and data sources
- context dependent disambiguation of biomedical termini and resolution of
- specific handling of common English word synonyms
- spelling variants of expressions in the source dictionary can be recognized
- high speed tagging and parallel workflow for multiple dictionaries
- incorporation of regular expressions (e.g. for the recognition of SNP rs numbers)
- full text annotation in XML, HTML or PDF format
- patent annotation
Found entities could be indexed, ranked and linked to other data.
Proof of Performance
Results in the international “critical assessment of text mining in biology” (BioCreAtIvE I and II).
The performance of ProMiner recognition of gene and protein names was tested in
the international “Critical Assessment of Text Mining in Biology” (BioCreAtIvE I and
BioCreAtIvE II). ProMiner was benchmarked against other industrial and academic
named entity recognition tools. Updated and new generated dictionaries
are continually evaluated in industrial applications.
Indexing machinery for fast indexing of huge document resources
Customer feedback on ProMiner:
- “We are amazed about its speed and ability to work with large input files.”
- “…impressive to get the combination of information in text with enriched background and experimental data.”
Module for named entity recognition in a larger workflow for information extraction
- Java module with defined input and output streams
- an annotator service for named entities in the Unstructured Information Management Architecture (UIMA) framework
- already integrated in the TEMIS - BER information extraction environment software
Content generation for the interpretation of large scale experimental data
- simple output file to fill/supplement data base content
- linkage to other data is easily possible through the provided mapping to data bases or controlled vocabulary
Relation to experimental data, interaction data bases or propriatory data through the provided mapping.
- gene and protein name dictionaries for various organisms:
- on request: yeast, fly, rat,
- gene ontology dictionary
- mesh disease dictionary
- organism name dictionary
- drug name/metabolite dictionary
Dictionary independent recognition
While parts of the life science terminology could be found with the help of dictionaries in some entity classes, it is not possible to enumerate all names. Examples are IUPAC names or SNP rs numbers. Here, we offer other techniques integrated as plugin into ProMiner:
- machine learning based
- IUPAC recognition
- SNP recognition
- regular expression based
- rs number recognition
- chromosomal location
Technical Specification and Parameter settings
- ProMiner is available for UNIX/Linux and Microsoft Windows
- a scheduler for
- ProMiner supports ASCII text, MEDLINE format, XML, HTML and PDF full text
- output format as: meta-information, XML tagged text and HTML output
Annotation of PDF
The increasing number of electronically available full text publications offers the
ability to process these documents and annotate the knowledge stored in them.
Integrated in the ProMiner software, we offer a special PDF plugin for the
annotation in PDF documents. Here the annotations are directly written into the
PDF output format.
For semantic search and visualization, we offer the semantic search engine SCAIView.
Further information about SCAIView: www.scai.fraunhofer.de/scaiview.