Solving the conundrum of identifying lenders in a bank statement
If you were a wealth manager or a lender analysing hundreds and thousands of narrations in bank statements every month, what would be most common data points of your interest?
- bank transfers
- ordinary financing
- tax payments
- income and spending patterns
- recurring transactions, and
- days of non-sufficient funds (NSF)
All play a vital role in assessing the financial health of your customers.
Identifying lenders in the bank statement is not just one small element of the large tapestry that is bank statement analysis. Before this, the bank statements must be machine-readable. Although a PDF or a scanned statement is an electronic file, such files are not machine-readable. If such files are parsed, they must be reconciled and checked for fraud. In this blog post, we’ll focus on identifying the lenders in a bank statement and what makes it difficult for machines?
Identifying lenders in a bank statement
One of the most sought after data points from a bank statement analysis is identifying lenders and loans running in the account. This is where competing lender activity must be identified in order to predict net excess cash flow and ensure sufficient funds. It empowers wealth managers and lenders to answer the holy grail of all questions,
“Will my customer have enough money to support withdrawal of $$$?”
One of the quicker ways to identify lenders is through text analysis. However, this is easier said than done. Many algorithms use time series analysis of cash flow, i.e. is this transaction a part of pattern of transactions.
One of the easier methods to identify lending activity in a bank statement is to identify auto-debits and then perform text analysis on the party involved to identify whether it is a lender. However, we cannot rely on it. There are other issues and challenges as described below:
- Truncation/ Deletion: Names can be truncated. e.g. The name of the party deducting EMI is completely deleted from this narration. Who is the lender?
NOTE: The narrations and the corresponding amount is for supporting the statements only. The narrations have been handpicked from different statements.
- Concatenation: Words can be concatenated arbitrarily. Algorithms need to be trained to identify lenders from umpteen combination of concatenations.
- Ambiguous context: In some cases, lexical analysis is not enough. A semantic approach is required. What meaning do you derive from Bajaj Finance crediting an amount on multiple occasions?
We have another example below. It’s understandable that the first transaction corresponding to LIC Housing is a loan but how do you make sense of the withdrawal of Rs.2,481.00 on 5th of every month — is it a premium or a LIC EMI?
- Name mangling: In some cases, names are arbitrarily concatenated and abbreviated to create a text that is recognisable for humans but less for machines. e.g. This is an ACH mandated auto-debit for RBL Bank but the narration of Retail Asset Department of RBL Bank is concatenated and abbreviated.
- Computational explosion: Another prevalent case is that the lender’s name can begin on any character of the narration line. Algorithms should be trained to scan for word boundaries because of concatenation. This implies a very expensive comparison of all permutations and combinators of each lender in each character position on the line.
- Disbursement of personal loan: There are many organisations that lend to their sister organisations and their employees during bad times and offer an emergency loan. Although they are not competing lenders, it does affect the repaying capacity of the merchant.
How a Dubai-based bank has solved it?
The problem of identifying a lending activity could be simplified if transactions were also expressed as codes. e.g. 714 is for Online local fund transfer while 985 is for fund transfer charges in the narration below.
While this problem might be tractable on a real-time basis because there is only countable lending activity in a bank statement. But wealth managers and lenders analyse thousands of lines of transactions daily. It is impossible to analyse millions of lines of transactions every month using a primitive eyeballing approach or even running a crawler on a table containing a list of all lenders in the country. Don’t you think this approach would require trillions of string comparison and humongous compute time in an optimistic scenario?
The name matching technology is one part of the problem. How to apply it is another part. This is why name identification in a narration is such a big problem. At Inkredo, we are combining computational elements that range from semantics and machine learning all the way to big data processing.
How is Inkredo approaching the truth?
- Keyword Mapping: Having a mapping of keywords and categories is a good starting point.
POScan be mapped to
ATWcan be mapped to
ATMcategory. Using keyword matching, we can identify the lender. In this case, it is a housing loan sanctioned by LIC. But, this approach also has its drawbacks. Narrations don’t appear this neatly, they can be abruptly abbreviated and concatenated and can appear like
BAJAJFINEMIin which case no keyword would be found and how do you know if
BAJAJFINEMIis same as
BAJAJ FINANCE EMI.
- Transaction Type: Categorising solely on the basis of narration may not cover all the cases. Type of transaction can also be introduced in the categorisation process. A keyword may have different meanings when it is a part of a credit, debit or default transaction.
- Prioritisation: A narration can contain keywords that may belong to two different categories. In that case, how do you decide which category should be chosen? You can assign priorities.
- Pattern Recognition: A single instance of a transaction may not convey much but a pattern can convey a lot. Let’s take the example of
Salarytransactions. If the narration contains a keyword like
SLRY, then you can categorise based on your keyword matching algorithm but this is usually not the case. Salary credits, many a time, appear like one of the following
Note that there is no instance of a valid keyword. In such cases, you can find a pattern. Multiple NEFT credits from the same involved party is a potential source of earnings and can be categorised as a
Salary transaction even when there is no such keyword present in the narration.
If you are interested to solve problems around the movement of money across the world, then we’re hiring!
Thanks to Samkit Jain for reading the initial draft and contributing to it.
This post was originally published on Medium by Kumar Tanmay.