Use Flow to extract text from scanned content using OCR

A common use for OCR (Optical Character Recognition) is to recognise all text and make a PDF – that previously just contained an image of text – fully searchable. Excellent for making documents more discoverable.

There is another use for OCR, which is to extract text from image-based content. For example if you deal with scans and faxes, and you require the actual text to be extracted and processed (e.g. placed in a SharePoint field, or a decision to be made based on an extracted value), then you may want to use the facility described below.

In this blog post, we will cover the following topics.

  1. Use the Muhimbi ‘Extract Text using OCR’ action to extract text from an image or image-based PDF (list attachment) and write the extracted text to a SharePoint List column.
  2. Explain how to prevent recursive (endlessly repeating) Flows.

Please note that extracting text will only work with image-based content (mainly scans and faxes). It is not possible to extract text from PDFs that contain ‘real text’, such as PDFs generated from MS-Word files.

Please make sure the following prerequisites are in place:

  • An Office 365 subscription with SharePoint Online license.
  • Muhimbi PDF Converter Services Online full or trial subscription (Sign up). Note that the Free subscription does not support OCR.
  • Appropriate privileges to create Flows.
  • Working knowledge of SharePoint Online and Microsoft Flow.

Step 1: Setting up your SharePoint Online Environment.

Create a SharePoint Online List and Add the following columns.

Extracted Text: Column type – Multiple lines of text – We will use this to Store the text extracted from the PDF document.

To Process: Column type – Yes/No (Default value ‘Yes‘) – This will be used to prevent recursive Flows.

From a high level, our Flow looks as follows.

Step 2:  For this demo, we will use the “When an item is created or modified” SharePoint ‘Flow Trigger’.

In the trigger, specify the path to the SharePoint Online List to monitor for new items.

Step 3: Initialize the variables with reference to the screenshot below:

Step 4: In this step (Condition) we will manage the recursive event (continuous loop)

I am using the AND operator to prevent an endless loop from happening.

  • “Has attachment” (Output of “When an item is created or modified” trigger) is equal to “True”.
  • “To Process” (Output of “When an item is created or modified” trigger) is equal to “True”.

  • The “To Process” is a Column of type “Yes/No” and the default value is set to “Yes (true)”. Only if both the values evaluate to true will it OCR the document or else it will just terminate the Flow. We will set the “To Process” field to “False (No)” in the “Update item” action.
  • Now, as the Flow updates a column in the same item, the Trigger (‘When an item is created or modified’) will always be invoked by the ‘Update Item’. However, now that we have set the “To Process” field to ‘False’, the Flow will be terminated when it is triggered a second time.

Confused? Just follow the instructions below and everything should be clear by the end.

Step 5: If both conditions evaluate to “true (Yes)”, OCR the document else “false (No)”, terminate the Flow.

Step 6: Add the “Get attachments“ SharePoint action and specify the path to the SharePoint Online List.

Id: Be very careful when selecting, the “ID” should be CAPITAL “ID”, the output from Step 1 “When item is created or modified”.

 This is the most important Part of our Flow

Step 7: As a List item can have multiple attachments, add the ‘Apply to each’ loop and set it to the “Body” field, output of the SharePoint Online ‘Get attachments’ action.

Step 8: Add the SharePoint Online “Get attachment Content” action and specify the path to the SharePoint Online List.

  • Id: “ID” it should be CAPITAL “ID“  the output from Step 1 “When item is created or modified”.
  • File Identifier: “Id” is the output of the “Get attachments” action.

Step 9: This is where the real magic happens, extracting the text from the image. In this example we keep it easy and we extract all text from the page. Note that it is also possible to specify a range of coordinates to extract the text from.

  • Source file name: Use the “Display Name” i.e. the output of the “Get Attachment” action.
  • Source File Content: The content of the file to process. Use the “Attachment Content” the output of the “Get attachment content” action.
  • Language: The language the source document is written in. It defaults to English, but supports other languages as well : Arabic, Danish, German, English, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish and Swedish.
  • Region: Specify the x, y, width and height of the region to retrieve text from. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then please take into account that internally the image is first converted to PDF, which may add margins around the image but guarantees that a single – unified – UOM is used across all file formats. If you are not sure how internal conversion affects the dimensions of your image or scan then convert the file to PDF and open it in a PDF reader.
  • Page number: By default text is extracted from all pages and concatenated. To extract the text from a specific page specify the page number in this field.
  • Performance: Specify the performance / accuracy of the OCR engine. It is recommended to leave this on the default ‘Slow but accurate’ setting.
  • Whitelist / Blacklist: Control which characters are recognised. For example limit recognition to numbers by white-listing 1234567890. This prevents, for example, a 0 (zero) to be recognised as the letter o or O.
  • Use Pagination: In some specific cases a single image spans multiple pages. Enable pagination for those cases.

Step 10: Set variable (OCR-Text). Set the value of the OCR-Text to the “Temporary” variable and “Out text” the output of the “Extract text using OCR”. Using this trick we concatenate the text of all the list item’s attachments into a single variable.

Step 11: Set variable(Temporary) to “OCR-Text”.

Step 12: Add the SharePoint Online “Update item” action, and specify the path to the SharePoint Online List for the item to be updated. Note: This Action is outside the “Apply to Each” Loop.

  • Id: “ID” it should be CAPITAL “ID“  the output from Step 1 “When item is created or modified”.
  • Title: “Title” the output from Step 1 “When item is created or modified.
  • Extracted Text: “OCR-Text” the output of Set variable(OCR-Text)
  • To Process: Set it to “NO” Note: This is very important to prevent an continuous loop.

Let’s Test our Flow:

  • Created a SharePoint List Item and add two attachments.

Important:  The “To Process” field is set to “Yes”

  • The Flow Completed successfully.

The SharePoint Online List item is updated.

Important: The “To Process” field is set to “NO”

As the SharePoint List item is updated, the Flow will be triggered again. However, this time it will not OCR the document it will just Terminate the Flow.

This is how we avoid continuous loops.

I hope this article was informative and thanks for reading.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s