Spot Model for Extracting Merchant Names from Bank Statements
We trained a machine learning model to identify parties.
Transforming raw data into meaningful insights is a challenge.
We wanted to extract merchant identity from bank statements but we had two problems:
- A merchant can appear as many names (e.g., Swiggy appears as Swiggy and Bundl Technologies), and
- A merchant can fall under many categories (vendor, customer, restaurant, etc.).
The current version of the Spot model addresses the first problem and the next version will address the second problem.
The Spot model transforms raw, unstructured, and messy data into meaningful insights. The images below show some examples of how merchant names can appear in bank statements. Not only do they pose a challenge for automatic analysis tools, but they are also hard for the naked human eye to make sense of. These are images of a table with relevant data rather than bank statements themselves, so we don’t compromise anyone’s identity.
Spot the party name in the Narration column.
The same merchant is expressed with different names or in different formats
Our goal is to analyze transactions automatically, so our customers don’t have to spend hours staring at bank statements. And grouping transactions from/to one merchant is an important step in achieving this goal. Questions like how much money is flowing into a merchant’s account during a certain month or a year become easy to quantify and answer.
A solution like the Spot model can be integrated into many open banking use cases.
Our model doesn’t just label transactions that can be later grouped together for answering different questions, it also gives our customers the power to edit and customize these labels. Each customer benefits from an ocean of data that is much larger than their own. The following lists some example use cases.
- In banking, the model can identify transactions with recurring merchants and it can identify subscriptions to build better customer engagement tools.
- Many businesses use the buy now, pay later model. They can improve their process of customer risk verification and streamline their customer approval process.
- In lending, the model can accurately identify sources and sinks of money and highlight a borrower’s liabilities. As a result, this can reduce credit risk and help lenders confidently make lending decisions.
- A loan broker can use the model to improve borrower screening and match the right borrowers to the right lenders with more efficiency.
- In personal finance, the model can identify earning and spending patterns. And these patterns can inform the design process of sustainable habit-building tools.
- Accountants and auditors can use a tool built with the model to reconcile banking transactions with accounting entries. Consequently, they can save time and improve productivity.
The Spot model uses a combination of techniques to transform raw data into meaningful insights.
The model is based on pre-defined rules and machine learning. More specifically, the model (built-in Python) uses regular expressions and spaCy library for generating insights. The input data can be broken down into two parts: somewhat structured and very unstructured. Since it is possible to extract patterns from the former, regular expressions are applied to the input data before the input data is fed into a natural language processing model (this statement underestimates the effort necessary to clean the data before applying the machine learning algorithm).
For example, regular expressions can be used to extract UPI handles, card numbers, and IFS codes. But for some narrations or input data, the regular expressions lead to noisy output that doesn’t make any sense at first glance. This is why combining another technique with regular expressions is critical. The machine learning approach targets the very unstructured part of the input data. In other words, the same narration can consist of both somewhat structured and unstructured data. And to extract valuable information from it, the Spot Model employs a combination of two techniques.
Input: narration from a bank statement
Output: JSON response with bank name, card number, IFS code, pre-defined keywords, name of merchant/organization/location, phone number, UPI ids, and remaining text after extracting basic information from the narration
As of January 2021, the Spot model has 86% accuracy.
This description of the Spot model seems like it was built easily but a lot of trial and error went into it. To ensure that our process of building the model was reliable, we made sure that our data was broken down into training and testing sets. In other words, the data used to train the model is separate from the data used to test the model. At Inkredo, we believe experimentation should never stop. So, we will keep improving this model and come up with new innovations to better serve our customers.
Example Use Case
If a borrower’s bank statement includes transactions like the ones shown in the image below, the lender can use the Spot Model to understand in what way and how frequently Paytm is used.
Example input data
The Spot model can tag each of these transactions with the “Paytm” label and the lender can aggregate all transactions with this tag in whichever way that is useful for them. For example, they might want to track money flow into or money flows out of the bank account.
Results of Spot model
Input transaction 1
Results of UPI transaction
Input transaction 2
Result of ATM transaction
- Bank: The names of banks, typically supported most of all Indian and some international bank.
- cards: Credit/Debit card numbers
- ifsc: IFSC codes of banks
- keywords: Pre-defined keywords.
- loan_ids: Work In Process, Right now XX1234 kind of numbers being extracted. Typically the common numbers mean a transaction belongs to the same category.
- party: Name of a party, it can be the name of a person, organization, location, or reason for performing the transaction.
- phones: Phone numbers.
- raw_text: The remaining text after extracting basic information from the narration.
- upi_ids: UPI ids that involved in transactions
Spot model currently supports Indian bank statements only. However, it can be easily trained in a week’s time on any bank statement across the world. For access and more information contact [email protected]
Thanks to Drashti Shah and Samkit Jain for editing the drafts.
This post was originally published on Medium by Krupa Galiya.