How to use NER and advanced NLP techniques in life sciences

Author's): CapeStart

Originally published in Towards Artificial Intelligence.

Review

The field of life sciences is struggling with an explosion of data. This key information, such as research articles, clinical trial reports, patient records, and even genomic sequences, exists in the form of unstructured text. Turning this vast text landscape into actionable insights poses a significant challenge. This is where the power of natural language processing (NLP) lies, especially Recognizing named entities (NER) comes into play.

Natural language processing is a field of artificial intelligence (AI) that focuses on building machines that can manipulate human language. NLP has advanced significantly in recent years – not only in understanding human language, but also in reading patterns in things like DNA and proteins, which have a structure similar to language.

Named Entity Recognition (NER)

The diagram below illustrates the NER process in detail.

Named entity recognition is an essential technique in NLP. Think of NER as a wizard that sifts through text to find and categorize specific “treasures” – named entities. This is an information extraction subtask. NER goes beyond simple word labeling and assigns contextually appropriate entity types to words or subwords.

Its main goal is to comb through unstructured text, identify specific passages as named entities, and then classify them into predefined categories. These categories typically include people's names, organizations, locations, dates, monetary values, quantities, and time expressions. Especially for life sciences, predefined categories may also include medical codes. By converting raw text into structured information, NER facilitates tasks such as data analysis, information retrieval, and knowledge graphing.

Consider the sentence: “J&J has received FDA approval for Janssen's Covid-19 vaccine in the United States in 2021.” Applying the NER rules described in the sources, the NER system would process this sentence.

How NER works: a step-by-step process

The NER process, although complex, can be divided into several key stages:

  1. Tokenization: The initial step is to divide the text into smaller units called tokens, which can be words, phrases, or even sentences. For example, “J & J”, “received”, “FDA”, “approval”, “for”, “his”, “COVID-19”, “vaccine”, “,”, “Janssen”, “,”, “in”, “the”, “United”, “United States”, “in”, “2021”,
  2. Feature extraction/entity identification: For each token, linguistic features such as part-of-speech tags, word embeddings, and context are extracted. Alternatively, potential named entities are detected using linguistic rules, regular expressions, dictionaries, or statistical methods. This includes recognizing patterns such as capitalization (“Steve Jobs”) or specific formats.
  3. Entity identification and classification: The system identifies potential entities and classifies them into predefined categories. Based on the types of entities NER serves and expanding to the healthcare/pharmaceutical field (which often includes specific products and conditions), NER would likely identify:
  • “J&J” as an ORGANIZATION. This is directly related to the “organizations” category mentioned in the sources.
  • “FDA” (Food and Drug Administration) as another ORGANIZATION. This is also the type of organization that would classify NER.
  • “COVID-19” as a DISEASE or HEALTH CONDITION. Although “medical codes” are mentioned, a system tuned for this domain would likely have a specific disease category, based on the concept of identifying “more” entity types beyond the standard list.
  • “Janssen” as a PRODUCT or DRUG. This would also be a domain-specific category for pharmaceutical products, extending the basic entity types to include specific items of interest in the industry, similar to identifying products in customer service analytics.
  • “United States” as LOCATION. This is directly related to the “locations” category.
  • “2021” as DATE. This is directly related to the “date” category.

4. Entity Scope Identification: In addition to classification, NER also determines the exact beginning and end of each mention of an entity in the text. This is crucial for precise data extraction.

5. Contextual understanding/contextual analysis: Modern NER models are sophisticated enough to take into account surrounding text to improve accuracy. For example, the context in “J & J released a new vaccine” helps the system recognize “J & J” as a company. Models like BERT AND ROBERT use context embedding to capture the meaning of words based on context, helping you deal with ambiguity and complex structures.

6. Post Processing: The initial stages are followed by post-processing to refine the results. This may include resolving ambiguities, combining entities that consist of multiple tokens (such as “New York” being a single-location entity), or using knowledge bases to obtain richer entity data.

The power of NER lies in its ability to understand and interpret unstructured text, adding structure and meaning to the vast amounts of text data we encounter.

Beyond NER: advanced NLP techniques

Although NER is basic, life sciences often require a more sophisticated understanding of language. Advanced NLP techniques, many of them supported by deep learningenable complex tasks to complement NER.

Information extraction: NER is a key element, but information extraction also includes extraction structural information (like relationships between entities) from unstructured text to populate databases or build knowledge graphs.

Become a Medium member

Question Answers (QA): The systems can identify entities in user queries (using NER) and find appropriate answers in documents. QA systems can be multiple-choice or open-ended and provide natural language responses.

Summary: This task shortens the text while retaining key information. Extractive summary pulls out key sentences while Abstract summary paraphrases, potentially using words that are not in the original text. This is useful for condensing research articles or clinical notes.

Topic modeling: An unsupervised technique for discovering abstract topics in a collection of documents. It views documents as sets of topics and topics as sets of words (e.g. Latent Dirichlet Allocation – LDA). This allows you to identify dominant research topics.

Sentiment analysis: Classifies the emotional intention of the text (positive, negative, neutral). Understanding sentiment around entities identified by NER can provide deeper insight. This can be applied to patient reviews or social media discussions about treatment.

Text Generation (NLG): Creates human-like text. Although less directly related to existing life science text, advanced models can generate draft reports or summaries.

Information search: Finds documents most relevant to your query, which are essential for searching extensive literature databases.

Why life sciences need NLP and NER

The life sciences are drowning in data, most of which is locked in unstructured text documents. NLP and NER are crucial because they provide the means to:

Transform unstructured data: They serve as a bridge, transforming vast amounts of raw text information into structured, categorized forms that machines can easily process and analyze.

Accelerate research and discovery: Scientists can quickly sift through vast amounts of literature, identifying mentions of specific entities (genes, proteins, diseases) relevant to their research, accelerating data analysis.

Improve clinical care: Interpreting or summarizing complex electronic health records (EHRs) is becoming feasible. Isolating key information such as a patient's history, symptoms, treatments, and outcomes can improve decision-making. NER could potentially identify medical codes or other critical entities in these documents.

Improve knowledge management: Building knowledge graphs identifying entities and their associations from scientific literature or clinical data is facilitated by NER and information extraction.

Compliance and analysis support: It becomes possible to automate the tedious process of sifting through legal or regulatory documents to find relevant information.

Analyze biological/chemical sequences: Some NLP techniques, such as those dealing with language-like data, can potentially be applied to the analysis of biological sequences.

Using NER and advanced NLP: use cases in life sciences

Based on the possibilities described in the sources, here are some potential applications in the Life Sciences domain:

Recognition of a biomedical entity: Identification and classification of life sciences specific entities such as genes, proteins, diseases, drugs, chemical compoundsand procedures from scientific articles, patents or clinical texts. This leverages NER's core capabilities for domain-specific entities.

Extraction of accounts from literature: Automatic identification of relationships between biomedical entities mentioned in scientific articles, e.g. drug-gene interactions, disease-symptom associations, protein-protein interactions. This is based on information extraction techniques provided by NER.

Clinical text analysis: Extract structured information from clinical notes, discharge summaries, and other EHR elements, including patient demographics, symptoms, diagnoses, medications, lab results, and treatment plans. A key part of this may be NER identifying medical codes.

Summary of scientific literature and clinical studies: Automatically generate summaries of complex research articles or research results using summarization techniques.

Identification of research trends: Using topic modeling to discover emerging topics and dominant topics in large corpora of scientific publications.

Powering biomedical question answering systems: Creating systems that can answer specific questions asked by researchers or clinicians by searching large databases of scientific or clinical texts.

Analysis of patient opinions and social media: Using sentiment analysis to assess patient perceptions of treatments, medications, or healthcare services potentially associated with specific entities.

Sequence analysis: Using techniques such as autoencoders to analyze patterns or detect anomalies in biological sequences.

Application

Named entity recognition and advanced natural language processing techniques are not just technological trends; they become essential capabilities for navigating the data-rich landscape of life sciences. By transforming unstructured text into meaningful, structured knowledge, NER and NLP accelerate research, improve patient care, and drive innovation.

While there are challenges related to domain specificity, ambiguity, and data sparsity, continuous advancements, particularly in deep learning and Transformer models, continue to improve performance and expand capabilities. Leveraging these powerful tools enables researchers, clinicians, and organizations to extract hidden gems from text, gain deeper knowledge, and ultimately contribute to scientific discoveries and better health outcomes. The journey in NLP continues to evolve, and in the life sciences, the use of these technologies is key to unlocking the future of understanding biology.

Originally published at https://capestart.com February 10, 2026

Published via Towards AI

LEAVE A REPLY

Please enter your comment!
Please enter your name here