Information Extraction from Text

At a Glance




Save the company money by automatically storing information in a database, and provide a customizable platform for future text extraction.


Developed using Python, the natural language processing (NLP) models are capable of determining the layman prescription instructions and the dispensed days supply achieving 92% or above on all models.

Expertise & Technology

Business Challenge

Over the last several years, members of the healthcare industry have faced the challenge of digitizing their physical databases. Transitioning has become a necessity because consumers expect information to be consistent and available. Companies benefit greatly from being able to access databases with organized information such as drug dosage, frequency, and recurrent demographic. Manually extracting information out of a text normally requires many cost-inefficient human hours. Documents can be written in over ten languages which adds a significant translation cost for the company. Real-world data is messy and unstructured and often contains missing values or inconsistencies. All of these aspects pose various problems for a human, but for a machine, they are considerably more manageable.

SFL Scientific’s Approach

SFL-Scientific’s goal is twofold: to save the company money by automatically storing information in a database, and provide a customizable platform for future text extraction. The most common information extracted is the classes of drug names, dosage frequency, drug type, and container size, in over ten European languages. The text is initially tokenized, meaning it is broken up into words, symbols, phrases, or other meaningful groupings referred to as tokens. Those tokens are passed into a machine learning algorithm that attempts to generate features based on capitalization, parts of speech, vowel groupings, and so on. This feature-based model was augmented by ensembling with several sequence labeling methods such as the Hidden Markov model, conditional random fields, n-grams, and Maximum Entropy text classifiers. Together, SFL used these methods of pattern recognition to extract only the pertinent information despite their varying structures. Indeed, out of several pages of text, only a handful of tokens are typically important in these particular documents. The result is a model that can parse and upload text into a database. SFL-Scientific has chosen to develop a web API for easy access to this information.

Business Value

The benefits of the digitization of pharmaceutical records extend to the company and the consumer. With this product, pharmaceuticals spend a significantly reduced amount of time finding and storing information. This software automates the process of manually reading through texts, saving thousands of human hours. The most common application is the extraction of dates, companies, people, and amounts from books, pamphlets, and more. Due to the customizable nature of this product, essentially any document’s key information can be extracted.

Work with Us

Start a Project

We'd like to help creating your next solution, whether modernizing legacy platforms or developing new AI solutions. Technology moves fast, let's build sustainable solutions.
Get Started