Most organisations have already transitioned to keeping records and information in electronic format. Now, the push is to rationalize all those different file types into one common and easily indexed format. Companies don’t want to keep old documents as unsearchable scans – or monthly ledger balances – as files that require a knowledge of (and access to) Excel to navigate. They want everything stored as searchable PDFs that can be indexed by search engines and kept at their employee’s finger-tips for years to come.
This requires not only the ability to automatically convert various document formats to PDF, it requires the ability to perform Optical Character Recognition on scanned documents in order to convert image based content into searchable and indexable text.
In this post we will use the Muhimbi PDF Converter connector, which comes with a number of OCR related facilities, including the ability to make image based files (Scans, faxes) fully searchable, as well as extracting this image-based text to allow information such as Invoice numbers, Purchase Order numbers and other identifiable information to be organised, searched, and used as part of a larger software / workflow process.
In this example we use SharePoint Online as the source and the destination of the content, but this works equally well with other storage providers including OneDrive, Google Drive, Dropbox, etc.
The purpose of this post is to design a Power Automate (Flow) solution that will show how to implement OCR.
This post is part 1 of a 2 part series about OCR in Power Automate.
- Convert image based SharePoint files to OCRed PDFs (this post).
- Extract text from image based SharePoint files.
Before you begin, please make sure the following prerequisites are in place:
- An Office 365 subscription with access to Power Automate (Flow).
- Muhimbi PDF Converter Services Online full subscription with OCR capability or trial subscription (Start trial). Do not use the Free subscription as it doesn’t support OCR, the free Trial subscription works fine.
- Appropriate privileges to create Flows.
- Working knowledge of Power Automate (Flow).
Create a Power Automate Solution to Convert an Image based file uploaded in a document library to an OCRed PDF.
Let’s first see how the basic structure of our Power Automate (Flow) looks:
Step 1 – Trigger
- The trigger to be used is ‘When a file is created in a folder’ in SharePoint.
- Whenever a file gets uploaded to the selected folder, the Power Automate (Flow) will get triggered automatically.
- For the ‘Site Address’ in the image below, choose the correct site address from the drop down menu.
- For the ‘Folder Id’ in the image below, select the source folder.
Step 2 – Get file content
- For the ‘Site Address’ in the image below, specify the same address as used in the Trigger in step 1.
- In the ‘File identifier’ field, navigate to the ‘Add Dynamic content’ line and choose the ‘File identifier’ option inside the ‘When a file is created in folder’ trigger.
Step 3 – Convert to OCRed PDF
- Next, we will use the action ‘Convert to OCRed PDF’ (a native action of Muhimbi’s Online PDF Converter shown in the image below) to generate an OCRed PDF version of the uploaded document.
- Populate the ‘Source file name’ with the ‘File name’ returned by the trigger. In the ‘Source file content‘ field specify the ‘File Content’ returned by the ‘Get file content‘ action.
- Language: This is the language used by the source document. It defaults to English, but there is also support for Arabic, Danish, German, Dutch, Finnish, French, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Spanish, and Swedish languages.
- Performance: Specify the performance / accuracy of the OCR engine. I recommend leaving this on the default ‘Slow but accurate setting’ in order to get the best results.
- White list / Blacklist: Control which characters are recognized. For example limit recognition to numbers by white listing 1234567890. This prevents, for example, a 0 (zero) to be recognized as the letter o or O.
- Use Pagination: In some specific cases a single image spans multiple page. Enable pagination for those cases.
- Regions: Specify the x, y, width and height of the region to retrieve text from. More on this topic in part 2 of this series. The unit of measure (UOM) is 1/72nd of an inch. When extracting text from non-PDF files, e.g. a TIFF or PNG, then please take into account that internally the image is first converted to PDF, which may add margins around the image but guarantees that a single – unified – UOM is used across all file formats. If you are not sure how internal conversion affects the dimensions of your image or scan, then use Muhimbi’s software to convert the file to PDF and open it in a PDF reader.
Step 4 – Create an OCRed PDF in SharePoint
- Enter the ‘Site Address’ of the SharePoint Site collection to write to OCRed PDF to.
- Similarly, select the ‘Folder Path’ where the OCRed PDF should be placed.
- Give a meaningful File Name to the created PDF, but make sure you remember to insert the extension .pdf after the File Name and to make the file name unique or multiple runs of the flow will overwrite the same file. I recommend basing it on the source file name, but with some kind of suffix.
- Select the ‘Processed file content’ option, shown in the image below, to populate the ‘File Content’ field available in the ‘Dynamic content’ section inside the ‘Convert to OCRed PDF’ action.
That is it, all done. Save the flow and upload an image based file (e.g. scanned PDF) to the source folder. Depending on the size of the document, after some time a OCRed version will be available in the destination folder. Open it in your favourite PDF reader and verify that you can now select or search the text.
Keep checking this blog for exciting new articles about Power Automate, SharePoint Online, Power Apps and document conversion and manipulation.