Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
INSIGHTS
10 min read
Share
It is no wonder companies are taking stringent measures to make sure they are fully compliant towards EU’s General Data Protection Regulations (GDPR) which protects privacy of Personally Identifiable Information (PII) of EU residents, a lot more stricter regulations are coming sooner than later such as California Privacy Right Act in 2023, SAFE DATA Act by the end of 2021. Hence protecting PII is becoming a matter of paramount importance to businesses. On the other hand, it is quite burdensome for humans to verify each and every public medium at their disposal to check if it contains PII. Hence, in this article we will delve deeper to understand how to retrieve textual information from an image by Optical Character Recognition (OCR) and how to use a Natural Language Processing Library called LexNLP to process the extracted text to check if it contains any PII. This is the output of a recent hackathon I participated. So, to give a high-level overview, first we will walk through OCR, its application and at the end we will see how LexNLP helps to detect presence of PII. In order to tie everything together we’ll write a simple python script that helps extract text and checks for PII. All these are exposed via a Flask web application for the convenience of user interactivity.
Optical Character Recognition or OCR is a technology that helps detect the presence of textual information in an image and extract the machine encoded text from it which the computer understands. Detecting text information through automated processes is not as trivial as it appears to humans. Behind the scenes is a series of complicated processes involving image processing and implementation of other complex algorithms which finally extracts the text. To the computer the processed image contains only matrix of white and black dots. Extraction involves multiple phases such as Despeckle, Binarisation, Line removal, Layout analysis, script recognition, segmentation, and normalization, Matrix matching and post processing. We will keep these jargons out of scope of this article for simplicity. Here in this article, we will stick to PyTesseract library to retrieve text from an image. This is a wrapper for Google’s Tesseract-OCR engine. There are other libraries also that can be used such as Pyocr.
Although this is a vast field in itself which primarily involves with interaction between computers and human language in order to process and analyze large amount of natural language data, here we will use this technology via an open-source library called LexNLP to extract PII from a textual content.
For this project, we are going to use the following
Enough of this explanations, now let’s build the real thing which in layman’s term will provide us with a web page that allows a user to upload an image, snapshot or scanned photo to the server and get back information from the image along with any PII that may be present. Initially we are going to install all the prerequisite for this project. Then we will develop the flask framework needed for ease of user interaction. We will use pip to install software packages. Install pip by following the below steps: Manually download the get-pip.py or use curl command to do so. curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
Then, navigate to the folder where get-pip.py is downloaded and run the following command to install pip python get-pip.py
Now we are ready to install pipenv using the following command pip install pipenv
pipenv creates and manages virtual environment for our project. Now, since we have pipenv, let’s create a directory and kick-start by following command. We use python3 system link here. mkdir lexnlp-extraction && cd lexnlp-extraction && pipenv install –three
Activate virtual environment by the command pipenv shell
Now we can install all packages/ dependencies by pip install command. As we know we have dependencies on pytesseract for OCR functions and Pillow for image manipulation, lets install both of them now.
pipenv install pytesseract Pillow
The most important dependency to install is lexNLP which provides the core functionality to fetch the PII from the supplied text. Follow the below steps to install lexNLP: Clone the lexNLP git repo to your local folder and install by pipenv install command
git clone https://github.com/LexPredict/lexpredict-lexnlp.git
cd lexpredict-lexnlp
pipenv install
The above steps install lexNLP library which provides arrays of features, but we will focus on the lexNLP-PII feature.
And the last but not the least is to install flask framework. Run the following command: pipenv install flask
As we have installed all the prerequisites, now it’s high time we create the necessary scripts. Should you need to learn flask before moving ahead feel free to visit the quickstart guide. Before writing the scripts, let us see how the framework layout looks like:
Let’s define the content of ocr_extraction.py:
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
def ocr_extraction(filename):
"""
This function will handle the core OCR processing of images.
"""
text = pytesseract.image_to_string(Image.open(filename))
return text
The above script takes charge of opening the image by using Image class of Pillow library and then extracts the text by using the image_to_string() function of pytesseract.
lexnlp_extraction.py is another file which defines a method to extracts the list of PII from the supplied text.
import lexnlp.extract.en.pii
def extract_pii(input_string):
return list(lexnlp.extract.en.pii.get_pii(input_string))
app.py is the file which literally starts the flask application. Here is the code.
import os
from flask import Flask, render_template, request
from ocr_extraction import ocr_extraction,pdf_extract
from lexnlp_extraction import extract_pii
# define folder to save the uploaded image
UPLOAD_FOLDER = 'static/uploads/'
# Allowed file image file extension type
ALLOWED_EXTENSIONS = set(['png', 'jpg', 'jpeg'])
app = Flask(__name__)
# validates file extension
def allowed_file(filename):
return '.' in filename and \
filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
# route and function to handle the home page
@app.route('/')
def home_page():
return render_template('index.html')
# route and function to handle the upload page
@app.route('/upload', methods=['GET', 'POST'])
def upload_page():
if request.method == 'POST':
# check if there is a file in the request
if 'file' not in request.files:
return render_template('upload.html', msg='No file selected')
file = request.files.get('file')
# if no file is selected
if file.filename == '':
return render_template('upload.html', msg='No file selected')
if file and allowed_file(file.filename):
file.save(os.path.join(UPLOAD_FOLDER,file.filename))
# OCR function extracts text
extracted_text = ocr_extraction(file)
# LexNLP extracts list of PIIs of possible different category
pii = ", ".join(map(str, extract_pii(extracted_text)))
# Sends the OCR extracted and LexMLP extracted texts
return render_template('upload.html',
msg='Successfully processed',
extracted_text=extracted_text,pii_text=pii,
img_src=UPLOAD_FOLDER + file.filename)
else:
return render_template('upload.html', msg='Please enter correct file form')
elif request.method == 'GET':
return render_template('upload.html')
if __name__ == '__main__':
app.run()
The upload_page() functions is called when image is uploaded from the HTML page. And the uploaded image is stored in the static/uploads/ folder. Similarly, the HTML files are stored in templates/ folder. So, we have to manually create both these folders and keep the HTML (shown below) files in the template/ folder. The index.html file is pointed by default(basic route) as the home page. index.html has hyperlink for upload.html. Below is the code.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Index</title>
<style type="text/css">
span {font-size: 1.6em;}
</style>
</head>
<body>
<H1 ><center>Welcome to LexNLP Demo</center></span></p>
<h3><center>This demo shows how to extract Personal Identifiable Information from an Image file</center></h3>
<h4><center>Please click the below link to upload the image file(png/jpg/jpeg)</center></h3>
<p font-><center><a href="http://127.0.0.1:5000/upload"><span>upload</span></a></center></p>
</body>
</html>
And at last, the upload.html is responsible for submitting the image via POST method and renders the result/ response from app.py. Below is the code:
<html>
<head><title>Upload Image</title></head>
<body>
<center>
{% if msg %}
<p class = "p3">{{ msg }}</p>
{% endif %}
<h1>Upload new File</h1>
<form method=post enctype=multipart/form-data>
<p><input type=file name=file> <input type=submit value=Upload>
</form>
<h1>Result:</h1>
{% if img_src %}
<img height = 400 width = 300 border=1 src="{{ img_src }}">
{% endif %}
{% if extracted_text %}
<p class = "p2"> The extracted text from the image above is: <br>
<b><i> {{ extracted_text }} </i></b>
</p>
{% else %}
The extracted text will be displayed here
{% endif %}
{% if pii_text %}
<p class = "p1"> The extracted Personal Identifiable texts from the uploaded image is: <br>
<b><i><red> {{ pii_text }} </red></i></b>
</p>
{% endif %}
</center>
</body>
</html>
Now we are ready to run the app. All we have to do is go to the virtual environment in the same directory by running the command:
pipenv shell
flask run
The server should start with the following message:
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Go to the browser and load the above displayed URL(127.0.0.1:5000) and we should see the following page When clicked upload link, it should show the following page: In the upload page, we have to upload an image which we intend to fetch the text and PII from by clicking choose File button and then selecting the file in local folder and then we need to click upload button, which shows the extracted text and PII. Below are the results when tried with different images such as color handwritten scanned image, scanned Black and White handwritten image and digitally written black and white screenshot. Result for Screenshot Image of digitally written document The above result is for screenshot image of digitally written document, which fetches the result(both text and PII such as SSN and Phone) accurately. Result for Black and White scanned Image of handwritten document The above result is for black and white scanned image of handwritten document, which poorly fetched the SSN where 1 is misinterpreted as ‘(‘ and hence no PII was detected. Result for Color scanned Image of handwritten document The above result is for color scanned image of handwritten document, which is quite similar to the previous one, the difference being "Live en" vs "Live tn". Moreover, it is worth the effort to contribute to LexNLP to include information like medical record, tax record ..etc in their PII scope. The chances of getting better accuracy in fetching text is highly dependent on how better contrast does the image have. This area definitely needs further analysis and testing. The primary purpose of this article is to bring and demonstrate my learning from a hackathon which may help the beginners trying to get into this field.
Although we have achieved a lot in this project, still we could have included PDF file in the text extraction process for which we could use PyPDF2 library and rest of the process works same. Pytesseract and LexNLP are great opens source libraries for OCR and PII detection which would be greatly useful in multiple use cases to make sure PII privacy is well complied. The source code of the project can be accessed at GitHub
Get emerging insights on innovative technology straight to your inbox.
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.