In the second half of 2018, we had developed a categorization model based on keywords/patterns found in the narration of the transaction. So if a Zomato or a Swiggy appeared in the narration, our algorithm would comfortably categorize it in food & beverage and travel for Ola and Uber. Soon it became challenging to categorize when merchants were lesser-known. for instance, How can we design when Swiggy’s name appears as Bundl Technologies and so for when merchant names appear as company name instead of tradenames.
The ultimate aim of identifying merchant names is categorizing income and expense behavior. it’s easier said than done. Income and expense behavior is subjective to underwriting. Zomato could also be categorized as a vendor, customer, food & beverage, or a restaurant.
How can we give users the freedom to name categories from merchant names and show them the output based on these categories?
We aim to inform the user about the behavior of transactions with merchants, For example, it could be income from customers or lending or vendor payments. But the names are not consistently spelled across the statement. It could be similar but not the same, e.g. Paytm may appear as Paytm, Payt, One97, One97 comm (parent name), or Zomato may appear as Zom, Zomat, Zomato Media Pvt Ltd, Zomato Media, etc.
How can we intelligently merge such names? Until the names are spelled similarly, it’s challenging to merge.
The Business version of the blog is available here. This blog emphasizes the technical aspect of the Spot model.
Derive expense & income behavior from banking transactions
We’ve been dealing with bank transactions since 2017. Transaction narrations in bank statements are by default raw and unstructured. This makes it messy to comprehend and reconcile transactions. We’ve found that names in transactions were in a terrible state. This made it challenging for us to determine the purpose of a transaction based on its description. Merely reading the amount and date wasn’t enough because only the owners of the bank statement could understand the description.
Here are a few examples of how the party is hidden in narration.
Table 1.1: Spot the party name in the Narration column.
Now you may wonder about many problems people face trying to decipher information from thousands of lines of banking transactions daily. Or simply put, even we sometimes wonder what the description could actually mean when it is difficult to recall by looking at the transaction value.
A multi-purpose transaction categorization solution supporting all open banking use-cases
As the infrastructure that allows third parties to access bank transactions is growing around the world, more and more businesses want to know the identities behind transactions in easier ways. We’re offering a categorization model with customized labels, in which our customers benefit from an ocean of data much larger than their own.
Turning messy bank transactions into beautiful names could have a vast impact on many use cases:
- Banking: Identify recurring transactions and subscriptions to build better customer engagement tools.
Table 1.2: UPI transaction from Phonepe
The above table shows a transaction with a merchant (Q94591823) via PhonePe. This is extremely tough to categorize because only the account holder might remember the merchant. Depending on the use case, this may be tagged as “Groceries” or “Supplies”, but it could also be “Personal expense”, or “shopping”.
- Lending: Identify sources of income and active liabilities with greater accuracy. Reduce credit risk, approve more loans with confidence.
- Buy now, pay later: Reduce customer risk verification and streamline the customer approval process.
- Personal Finance: Build user experiences to build more habits. Identify more earning and spending patterns with higher accuracy.
- Loan Broker: Improve customer screening. Match your customers with the right lenders with greater efficiency.
- Accounting: Reconcile banking transactions to accounting entries. Bookkeepers save hours and get their job done in minutes.
Identifying merchant names from transactions with regex
To resolve this problem, we’ve attempted out diverse approaches, such as regex-based, narration’s pattern-based, and ML-based.
A regex(rules-based) approach extracts the information which has a common pattern. For example, UPI has a specific structure that can be extracted by regex rules.
Table 1.3 Extraction of UPI ids from Narration with regex
The unstructured data which rule-based approach can’t extract, for that we have the Machine Learning approach.
How we started with a categorization model?
Before building the Spot model, we had developed a categorization model that was trained with raw narrations.
Before building the Spot model, we took a Deep Learning approach where we fed raw narrations. The prediction of the model was a list of words with probability scores. Words with a score greater than 0.49 were taken as the final result.
Results from the Previous Model
The previous model required sufficient data for each of the categories for training. In the previous model, we landed on the problem of imbalanced data set where it required a lot more data for training each of the transaction types. for example, it had been giving good results on UPI transaction but when it comes to ATM transaction the results were poor. We could get more data from the transaction but categorized data was a challenge. For this, we created an internal tool, Pandora, to manually verify and label the new data every day. The labeled dataset was further used to train the model. However, the problem of the imbalanced data set wasn’t solved completely.
The objective of an ML approach
From the regex(rule-based) approach, we could identify the merchant’s identity in the form of UPI handles, card numbers, and loan numbers. However, there were a significant number of transactions where the regex approach gave us a noisy output. We found that names in those narrations weren’t placed in any pattern. So we heavily cleaned such narrations before training any model.
Table 1.5: The experiments with pre built and own custom models
As per the above table, we went with a customized Spacy model. We trained our model on 7000+ narration from various Indian banks.
How we arrived at Spacy model?
For training a model with various techniques like Stanford NER, Spacy, and BERT we faced two problems:
1. Each technique needs its own input format.
2. The data that we have Annotated has some issues like single word annotated multiple times, overlapping of annotations.
To address both, we developed a tool that first validates the annotations and returns valid annotations only, then converts annotations in the format as per the given technique. For example: if a given technique is Spacy, then it validates the annotations and converts them into a format that Spacy needs. This annotation conversion was very helpful to attempt annotated data with any technique.
After having all valid annotations we have trained various models with Scipy, BERT, and Stanford. With various hyperparameters and flags, we have trained and tested various models, the deep analysis here: Track models. To test how the model performs with unseen data, we developed a validation script having all custom performance matrices.
we have removed the extra tokenization as it’s breaking the actual results.
For example: with Tokenization “Returned” tokenised in [“Return”,”ed”]
We tried with Base BERT model tokenization and found the need for more annotated data. Hence we had to drop the thought of using any tokenizer that needs annotated data and training. Instead, we built our own list of keys as a separator that split the words. We then learned that too many keys gave unexpected results. So we solved those issues, removed extra tokenization then trained the Spacy model again.
Table 1.6: Comparing Result with tokenization and without tokenization
Building pipeline & deployment
We developed a pipeline that’s responsible for fetching data from bank statements and provides the desired result.
We have deployed the model on AWS Sagemaker.
Diagram explaining a process and final results for the Spot model.
Results of Spot model
Input transaction 1
Table 2.1: UPI Transaction
Table 2.2: Results of UPI Transaction
Input transaction 2
Table 2.3: ATM Transaction
Table 2.4: Results of ATM Transaction
- Bank: The names of banks, typically supported most of all Indian and some international bank.
- cards: Credit/Debit card numbers.
- ifsc: IFSC codes of banks.
- keywords: Pre-defined keywords.
- loan_ids: Work In Process, Right now XX1234 kind of numbers being extracted. Typically the common numbers mean a transaction belongs to the same category.
- party: Name of a party, it can be the name of a person, organization, location, or reason for performing the transaction.
- phones: Phone numbers.
- raw_text: The remaining text after extracting basic information from the narration.
- upi_ids: UPI ids that involved in transactions.
Current status of Spot model
As of Jan 2021, the accuracy of the Spot model is 86%. We’re aiming to take it to a minimum of 95%, by applying a Feedback mechanism. The feedback mechanism gives us a good amount of annotated data. Post this, we shall easily be able to calculate the net of transactions between merchants. These values corresponding to merchant names will give users the true sense of income and expense behavior from banking transactions.
Spot model currently supports Indian bank statements only. However, it can be easily trained in a week’s time on any bank statement across the world. For access and more information contact [email protected]
This post was originally published on Medium by Krupa Galiya.