Extract PDF Forms Data and Parse XML in Power Automate using XPath

In this post, we’ll extract PDF form data as XML, parse that XML data and store it in a SharePoint List using Power Automate.

We will also explore how we can parse the XML data to extract meaningful information using the XPath() function and the Parse JSON(my next post) action in Power Automate.

Scenario

Employees download a Form from the company portal, fill in the relevant details, and then upload the form to a SharePoint Document Library. This Power Automate solution(Microsoft Flow) picks up the uploaded form automatically, extracts the data contained within the PDF form, and then adds the data to a SharePoint list.

Before we begin, please make sure the following prerequisites are in place:

  • Power Automate subscription.
  • Muhimbi PDF Converter Services Online Free, full or trial subscription (Start trial). If you are only interested in parsing XML, you can skip this step and simply copy the XML which I have provided below.
  • Appropriate privileges to create Power Automate Flows.
  • Working knowledge  Power Automate.

First, we’ll extract PDF form data using the xpath() function.

Note – If you are trying to understand how XPath works you can skip Step 1 to Step 5, copy the XML in the compose action and then follow along.

Step 1 – Create a new Flow using the ‘Automated cloud flow‘ option.

Step 2 – Give your Flow a meaningful name, select the ‘When a file is created in a folder‘ SharePoint trigger, and click ‘Create‘.

  • In the trigger, specify the path to the SharePoint Online Library to monitor for new file.

Step 3 –  Add the SharePoint “Get file content” action to the Flow canvas and configure it with reference to the details below:

  • Site Address: Specify the path to the SharePoint Online site-collection which holds the file.
  • File Identifier : ‘x-ms-file-id‘ the output from the ‘When a file is created in a folder” action.

Step 4 – Add the Muhimbi “Convert document” action to the Flow canvas and configure it with reference to the details below.

  • Source File name:File name with extension” the output from the “When a file is created or modified (properties only)” action.
  • Source file content:File Content” the output of the “Get file content” action.
  • Output Format: XML

Step 5 –

  • Add the “Compose” action to the Flow canvas and add the expression below to it.
base64ToString(outputs('Convert_document')?['body/processed_file_content'])
  • The expression will convert “Base 64 to String”.
  • Click the “Test” button in the top right corner.
  • Select the “Manually” option.
  • Upload a supported file type to the folder that is monitored by the configured SharePoint trigger
  • After a few seconds, you should see the output below –

XML output below –


<?xml version="1.0" encoding="utf-8"?>
<fields xmlns:xfdf="http://ns.adobe.com/xfdf-transition/">
  <GivenNameTextBox xfdf:original="Given Name Text Box">Clavin</GivenNameTextBox>
  <FamilyNameTextBox xfdf:original="Family Name Text Box">Fernandes</FamilyNameTextBox>
  <Address1TextBox xfdf:original="Address 1 Text Box">Sai Jewel</Address1TextBox>
  <HousenrTextBox xfdf:original="House nr Text Box">Central Park</HousenrTextBox>
  <Address2TextBox xfdf:original="Address 2 Text Box">Kharghar</Address2TextBox>
  <PostcodeTextBox xfdf:original="Postcode Text Box">410210</PostcodeTextBox>
  <CityTextBox xfdf:original="City Text Box">Navi Mumbai</CityTextBox>
  <CountryComboBox xfdf:original="Country Combo Box">India</CountryComboBox>
  <GenderListBox xfdf:original="Gender List Box">Man</GenderListBox>
  <HeightFormattedField xfdf:original="Height Formatted Field">180</HeightFormattedField>
  <DrivingLicenseCheckBox xfdf:original="Driving License Check Box">Yes</DrivingLicenseCheckBox>
  <FavouriteColourListBox xfdf:original="Favourite Colour List Box">Yellow</FavouriteColourListBox>
</fields>

Before we extract meaningful data from the above XML, let’s first get the basics of XML defined.

What is XML?

XML stands for eXtensible Markup Language. If you have been working with HTML you notice that XML is a markup language much like it. However, they function very differently as XML was designed to carry data and HTML was designed to display data.

In the above XML, the data is we are interested in  – “Clavin”, “Fernandes” etc… In other words the text in between the two arrows>< brackets.

Tip – Before you start working with XML in Power Automate, please check the XML for Syntax Errors. You can use a Free Online Validator(link).

Now that we understand XML, What is xPath and how does it help to Extract Data from XML?

XPATH checks the XML for nodes or values that match an XPath (XML Path Language) expression and returns the matching nodes or values. An XPath expression, or just “XPath”, helps you navigate an XML document structure so that you can select nodes or compute values in the XML content.

Tip –  Use an XPath Online Tester before you add it to the Power Automate Expression.

For the above XML let’s look at a few test cases :

Test CaseXpathValue
Select the document node/Root is [Element :<fields/>]
Select all fields under the root node/fieldsIt returns all node under the <fields/>
Select a single node within the GivenNameTextBox/fields/GivenNameTextBoxIt returns the entire GivenNameTextBox node – <GivenNameTextBox xfdf:original=”Given Name Text Box”>Clavin</GivenNameTextBox>
Select the the value from the GivenNameTextBox Node/fields/GivenNameTextBox/text()It return the value “Clavin”

Step 6 –

Extract the text() value from XML using the above xPath expression.

Syntax –

xpath('<xml>', '<xpath>')
<xml>YesAnyThe XML string to search for nodes or values that match an XPath expression value
<xpath>YesAnyThe XPath expression is used to find matching XML nodes or values
  • It will return – An array with XML nodes or values that match the specified XPath expression.
  • In our case, there is only one element in the array, so we’ll use the first value using the Expression first()
first(xpath(xml(outputs('Base64_to_XML(String)')),'/fields/GivenNameTextBox/text()'))

The output of the above action :

Step 7 – Add SharePoint “Create Item” action, we will directly pass the expressions in the “Create item” action.

  • Site Address: Enter the location of the SharePoint list where list item should be added.
  • List Name: Select the SharePoint list where the list item should be added

Note: In this step we will directly add the Expressions and Maps the Expressions to the SharePoint list.

  • Title – first(xpath(xml(outputs(‘Base64_to_XML(String)’)),’/fields/GivenNameTextBox/text()’))
  • FamilyNameTextBox – first(xpath(xml(outputs(‘Base64_to_XML(String)’)),’/fields/FamilyNameTextBox/text()’))
  • Address1TextBox -first(xpath(xml(outputs(‘Base64_to_XML(String)’)),’/fields/Address1TextBox/text()’))
  • HousenrTextBox -first(xpath(xml(outputs(‘Base64_to_XML(String)’)),’/fields/HousenrTextBox/text()’))

  • That’s it- you’re done!  Publish your Flow and upload a supported file type to the folder that is monitored by the specified SharePoint trigger. After a short wait, a new list item within the target SharePoint list should have been created.

Subscribe to this blog for the latest updates about SharePoint Online, Microsoft Flow, Power Apps and document conversion and manipulation using The PDF Converter.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s