Why OCR-ing a bank statement is a bad idea

Why OCR-ing a bank statement is a bad idea

A lending company needs the answer to just one question when giving loans, “Will the borrower be able to repay the loan?”.

Coming to a binary (yes or no) answer to that question involves a lot of work (both manual and automated). Lending companies go through various financial documents of the borrower and perform a thorough analysis to come to a conclusion. With the advancements in artificial intelligence and machine learning, some companies have been able to reduce the time to approve a loan to a day (but don’t share the percentage of bad loans probably because OCR systems are still not as good as they want it to be). A bank account statement can be tens or hundreds of pages long with thousands of transactions. An individual’s account statement can contain just a few hundred transactions while a corporate’s can be in thousands. To understand the past spending behaviour of a borrower and predict the future loan repaying ability, one of the financial document that every lending company asks for is a bank account statement.

Bank statements shared come in all types imaginable. You’ll find PDFs (both bank generated and scanned copies), CSVs, images, in rare cases even HTMLs. In this post, we’ll talk about the most common type, image. Even in images, you’ll see a lot of variety,

  • screenshots
  • blurry photos
  • high-resolution photos
  • low-resolution photos
  • photos with folded pages
  • photos in bad lighting

Dealing with images has always been problematic. You cannot simply copy-paste the text and neither can you accurately get the text because of the ambiguous shape of letters and numbers. They are human readable but not machine readable.

OCR to the rescue!

Optical Character Recognition or OCR is a technology that recognizes text within an image. Humans have the ability to easily understand the text in an image, however complex (after all, we are the masters of the sacred texts!).

Source: https://www.reddit.com/r/comics/comments/7xi2bh/the_sacred_texts_oc/

Over the past few years OCR solutions have really gotten much better. They are able to recognise handwritten texts with a good amount of accuracy. Giants like Google and Microsoft have also invested in the field and have come up with their own text recognition products.

It’s a known fact that OCR works well when the characters are printed, image quality is high and lighting is ideal. Bank statement images shared by the borrower have one thing going for them, they contain printed text. But this is not enough. Even with ideal conditions, it won’t be enough.

Majority of errors in OCR systems are because of incorrect classification. It usually misclassifies in cases where the features of a letter and number are same. Some of the ambiguous cases are,

  • O (letter) and 0 (number)
  • I (uppercase i) and 1 (number)
  • l (lowercase L) and 1 (number)
  • S (uppercase s) and 5 (number)
  • Z (uppercase z) and 2(number)

When using OCR for general purpose text detection, this ambiguity might not be a serious concern, but when using it on a bank statement especially for lending purposes, this is a serious concern. Misclassifying 50,000 as SO,OOO is quite serious. It would lead to missing important transactional entries.

Images shared by borrowers are usually not in the ideal condition. They are in bad lighting, blurry, low res, have pen/pencil markings, pages are folded, etc. All these factors act as a catalyst and lead to more and more incorrectly classified characters.

Some examples where OCR didn’t work for us

Could not detect all the text. Misread . as : (Microsoft Azure’s Computer Vision API)
Could not detect every text. Misread 0 as O and , as . (Google’s Cloud Vision API)
Could not detect any text (Microsoft Azure’s Computer Vision API)
Misread , as . and C as G and . as : (Google’s Cloud Vision API)
Misread Z as 2 (Google’s Cloud Vision API)

Current state of OCR systems

When Inkredo was into P2P lending, we too accepted images of bank account statements and manually typed every entry in an Excel sheet. Expectedly, this was a time taking process and we tried multiple OCR solutions — in-house, open-sourced and paid — but, none of them gave us the desired result. Some of them worked really well with bad quality photos but all of them struggled with ambiguous characters.

To conclude, OCR is not reliable for text detection in financial documents where reading a comma as a dot (or vice-versa) can make a significant difference. PDF (containing text and not scanned images) should be the preferred type because it’s not as easy to manipulate as a CSV and is easier to extract text from PDF-encapsulated files rather than images. Plus, you won’t have to buy expen$ive OCR solutions.

Do you think OCR is reliable when it comes to credit risk assessment? Share your thoughts in the comments.

Click here to try a demo of Inkredo.

This post was originally published on Medium by Samkit Jain.