Case Study 7 MIN

Text Extraction from Legal Documents

At a Glance




Develop a solution to algorithmically extract key terms from contracts, financial, and legal documents for LinkSquares’s Smart Value contract analysis platform and delivery of new, reliable enterprise-grade machine learning solutions that integrate into an existing software-based solution.


Fully integrated end-to-end framework built to run from a flexible environment, such as a cloud instance. Custom natural language processing-based solution in conjunction with supervised learning algorithms to extract the key terms. Successful classification accuracy and greater analytics functions allowed LinkSquares to rollout faster and secure new clients.

Expertise & Technology

For many organizations, leveraging Artificial Intelligence’s full capability and effectiveness begins by exploring and creating an effective roadmap. LinkSquares seeks to disrupt the legal and finance industries by providing an automated, software-based solution to streamline contract analysis. The company grew quickly and needed to optimize its existing solution in order to scale its services. As AI can itself create new offerings, products, and increase the accuracy and efficiency of services, here we consider deploying machine learning and natural language processing on text documents.

Enabling faster contract discovery

Companies experiencing rapid growth often lack the bandwidth to track each line of every contract, service agreement, or legal document before it is executed. Even in the most carefully reviewed agreements, some information is forgotten as soon as the contract is signed. Once the business has matured and due diligence projects arise (ex. when a law changes, an agreement expires, or an acquisition takes place), companies must conduct detailed reviews of all signed contracts and identify specific terms within them. 

LinkSquares’ founders experienced the painful reality of reviewing existing legal contracts firsthand when their previous employer underwent an acquisition. “We manually searched through all of our existing contracts to identify privacy language and other information crucial to legally moving clients to a new infrastructure provider for our software service,” said Vishal Sunak, Co-Founder & CEO at LinkSquares. “We found that the language deviated from contract to contract, making the review process of thousands of agreements both timely and frustrating.” The team identified existing software solutions helping companies efficiently address the pre-signature workflow: Contract creation, terms negotiation, and internal workflow. However, the industry lacked a software solution to help companies mine for information in existing contracts. LinkSquares saw this gap as an opportunity to develop software to help customers with post-signature contract analysis. 

Working to find a machine learning solution

As their SaaS offering on AWS grew and even with the ability to quickly mine for contract data in a cloud-based environment, the LinkSquares team still needed a scalable solution for identifying and classifying legal language. Initially, they built a searchable contract database, but as the company expanded and required a more complex and sophisticated approach, so did its need for an automated solution for extracting key contract metadata. LinkSquares approached SFL Scientific about devising a solution to algorithmically extract key terms from various text documents. 

SFL Scientific builds deep relationships with each client to understand the client’s challenges and its short- and long-term vision for using and deploying artificial intelligence and machine learning. We worked closely with LinkSquares to understand and build a data strategy and solution architecture suitable for their use case and timeline. “LinkSquares built its initial prototype using SQL running on AWS, but it was not using any machine learning technology,” said Michael Luk, CTO, SFL Scientific. “We learned about the team’s vision to process terms and language automatically and understood the business pain points. We proposed building a custom machine learning solution to help the team scale and improve accuracy.” Approximately one month after our initial engagement, data scientists at SFL Scientific conducted a proof of concept showing the LinkSquares team how they could deploy a scalable, automated analysis solution using natural language processing and machine learning.

Solution development & testing

SFL Scientific’s task was to provide LinkSquares with an algorithm to perform two main operations:

  1. Extract key terms from a legal document.
  2. Classify tokenized text into pre-defined categories.

The algorithm that we delivered was then deployed through AWS, which ran the code on demand: Whenever a document was uploaded, the code was automatically launched. With AWS, the whole deployment into the production process was streamlined and made easily accessible to all LinkSquares employees who needed access to this asset. The text extraction algorithm consisted of three main steps:

  • Feature engineering
  • Modeling, model stacking & ensemble methods
  • Post-processing


Feature Engineering

First, the algorithm tokenizes the raw text of the legal documents using a regular expression tokenizer, which simply means that each word of the text is parsed and stored as an independent observation. Following this, hundreds of features are created from the tokenized text in three ways: Rule-based features, token-based features, and sequence- level classes as features. Rule-based features are created by matching a “fuzzy dictionary” to a set of predefined, known smart term values. Hard-coded rules should translate to a known class or entity. For example, North American states are hard-coded to be the “Governing Law” class for documents classified to have that category. The classes determined from these hard-coded set of rules are saved as features for generating the model. Token-based features are generated on a per- token basis based on knowledge of the pre-defined, known smart terms. The features are lumped into three general categories: Token- level [Is this word a noun?], sentence-level [Is this token the first token in a sentence?] and document-level [Is this token found in a particular section?]. Other examples of token-based features include categories and identifications such as “Is the token capitalized?”, “How many letters ’A’ or ‘B’, etc., are there in the token?”, “How long is the token?”, “Is the next word a known smart term?”. Sequence-level features are generated by predicting the classes of each token using several sequence-level machine learning models such as Conditional Random Field, Hidden Markov Model, N-gram model, and neural networks. Just as the rule-based classes are saved as features, the classes determined from these sequence-level machine learning models are also saved as features. This testing forms the basis of the modeling approach and is applicable across many natural language processing applications to analyze unstructured text data. 


Since no single model or rule can guarantee a token is a specific class, a model stacking ensemble technique was implemented to better predict the class of a token. XGBoost, a gradient boosted decision tree-based model, was implemented as a meta-classifier, which uses the class predictions from the hard-coded rules and sequence-level models as features. The XGBoost meta-classifier also uses the hundreds of token based features as predictors and is trained against human 

Automation via Natural Language Processing

annotated ground truth or manually tagged data. XGBoost assigns a probability score associated with each class prediction, so a threshold (or level of guaranteed accuracy) was determined from the probabilities assigned using a holdout set to test and maximize the F-measure to determine the final class.

Post Processing

Once the classes for each token were predicted via the XGBoost meta-classifier, the predictions were further cleaned. Continuous tokens are concatenated and strung together, enabling a more robust and homogenous output to be produced. For example, dates are formatted into Month/Day/Year as opposed to leaving them as separate entities for each value, making it easier to digest the data. 

A thorough exploratory data analysis is critical, and inspection of the available text and document data must precede solution design; Understanding the variability, structure, terms, meaning, and other parameters, directly affects the final approach and solutions.

The NLP algorithm developed by SFL Scientific completely revolutionized the post-signature contract review process for LinkSquares. The final machine learning solution enabled the LinkSquares software platform to automatically run the code on thousands of documents in seconds. Whenever a document was uploaded, the machine learning code automatically launched. As validation, every result showed an exponential improvement in time spent reviewing each document, and eventually, improvement in tagging and parsing accuracy compared to the human auditors.

Saving time, resources, and headaches

LinkSquares placed the developed solution into production. Having identified SFL Scientific to help it take advantage of machine learning and NLP further emboldens LinkSquares as the team develops cutting-edge solutions in the cloud. “It’s been fantastic engaging with SFL Scientific as its team are experts in the AI space,” says Alexander, CTO, LinkSquares. “They understand the business challenges we’re trying to solve and they’ve given us the guidance we need to use new technology to tackle these challenges. I think of them like they’re a part of our team. They’re a valued partner.” The team plans to explore additional technologies that they can use to drive further innovation in their industry. “We’re excited about the future of our offering and how we can help legal and finance teams eliminate manual reviews of files,” said Sunak. “We’re excited to build out more AI and take advantage of new services to continue exploring what’s possible. For over two years, we’ve developed a strong relationship with SFL Scientific and leveraged their skills to develop and deploy machine learning in our systems.”

Work with Us

Start a Project

We'd like to help creating your next solution, whether modernizing legacy platforms or developing new AI solutions. Technology moves fast, let's build sustainable solutions.
Get Started