Bringing Various Subtypes of XML Documents to a Generic Form

In the test project credit-forms-unification-sample, an example is provided on how to parse structurally different but content-wise similar XML documents.

Let's assume that we have loan applications from three different banks.

DeutscheBank:

                    <?xml version="1.0" encoding="UTF-8"?>
                    <LoanApplication>
                      <ApplicantInfo>
                        <FullName>
                          <FirstName>Sofia</FirstName>
                          <LastName>Lorenzo</LastName>
                        </FullName>
                        <DateOfBirth>1992-03-24</DateOfBirth>
                        <TaxIdentificationNumber>545-56-7190</TaxIdentificationNumber>
                        <ContactInfo>
                          <EmailAddress>sofia.l@example.com</EmailAddress>
                          <PhoneNumber>+39 02 98765432</PhoneNumber>
                          <MailingAddress>
                            <Street>28 Via Roma</Street>
                            <City>Milan</City>
                            <ZipCode>20121</ZipCode>
                            <Country>Italy</Country>
                          </MailingAddress>
                        </ContactInfo>
                        <MaritalStatus>Single</MaritalStatus>
                        <Dependents>0</Dependents>
                        <Nationality>Italy</Nationality>
                      </ApplicantInfo>
                      <EmploymentInfo>
                        <EmploymentStatus>Employed</EmploymentStatus>
                        <Employer>
                          <Name>Milan General Hospital</Name>
                          <Location>
                            <Street>10 Corso Vittorio Emanuele II</Street>
                            <City>Milan</City>
                            <ZipCode>20122</ZipCode>
                            <Country>Italy</Country>
                          </Location>
                        </Employer>
                        <JobDetails>
                          <JobTitle>Registered Nurse</JobTitle>
                          <JobDuration>7 years</JobDuration>
                        </JobDetails>
                      </EmploymentInfo>
                      <FinancialInfo>
                        <Salary>50000.75</Salary>
                        <CreditRating>710</CreditRating>
                        <MonthlyExpenses>2700.50</MonthlyExpenses>
                        <LoanAmount>42000.00</LoanAmount>
                        <LoanPurpose>Further Education</LoanPurpose>
                        <ApplicationDate>2024-03-25</ApplicationDate>
                        <LoanOfficerName>Luca Rossi</LoanOfficerName>
                    </FinancialInfo>
                
                
Barclays:
                
                    <?xml version="1.0" encoding="UTF-8"?>
                    <LoanApplication>
                      <docDate>2024-03-25</docDate>
                      <Applicant>
                        <Name>
                          <First>Lisa</First>
                          <Last>Martin</Last>
                        </Name>
                        <DateOfBirth>1987-12-12</DateOfBirth>
                        <TaxID>456-78-9012</TaxID>
                        <Location>
                          <Street>12 Binnenhof</Street>
                          <City>Amsterdam</City>
                          <ZipCode>1011 AB</ZipCode>
                          <Country>Netherlands</Country>
                        </Location>
                        <ContactInfo>
                          <Email>lisa.m@example.com</Email>
                          <Phone>+31 20 9876543</Phone>
                        </ContactInfo>
                        <MaritalStatus>Single</MaritalStatus>
                        <Dependents>0</Dependents>
                        <Nationality>Netherlands</Nationality>
                        <PreviousLocation>
                          <Street>28 Piazza San Marco</Street>
                          <City>Venice</City>
                          <ZipCode>30124</ZipCode>
                          <Country>Italy</Country>
                        </PreviousLocation>
                        <ResidenceStatus>Own</ResidenceStatus>
                        <ResidenceDuration>6</ResidenceDuration>
                      </Applicant>
                      <Employment>
                        <Employer>EuroTech Solutions</Employer>
                        <JobTitle>Marketing Specialist</JobTitle>
                        <JobDuration>5 years</JobDuration>
                        <WorkLocation>
                          <Street>7 Leopoldstraße</Street>
                          <City>Munich</City>
                          <ZipCode>80331</ZipCode>
                          <Country>Germany</Country>
                        </WorkLocation>
                      </Employment>
                      <FinancialInfo>
                        <Salary>71000</Salary>
                        <CreditRating>750</CreditRating>
                        <MonthlyExpenses>2600.50</MonthlyExpenses>
                        <LoanAmount>48000.00</LoanAmount>
                        <LoanPurpose>Education Expenses</LoanPurpose>
                      </FinancialInfo>
                
                
RaboBank:
                
                    <?xml version="1.0" encoding="UTF-8"?>
                    <LoanApplication>
                      <documentDate>2024-03-24</documentDate>
                      <Applicant>
                        <FullName>
                          <FirstName>John</FirstName>
                          <LastName>Smith</LastName>
                        </FullName>
                        <DateOfBirth>1980-03-25</DateOfBirth>
                        <SocialSecurityNumber>987-65-4321</SocialSecurityNumber>
                        <Address>
                          <Street>12 Park Lane</Street>
                          <City>London</City>
                          <PostalCode>W1A 1AA</PostalCode>
                          <Country>United Kingdom</Country>
                        </Address>
                        <ContactInformation>
                          <Email>john.smith@example.com</Email>
                          <Phone>+44 20 12345678</Phone>
                        </ContactInformation>
                        <MaritalStatus>TRUE</MaritalStatus>
                        <Dependents>2</Dependents>
                        <Citizenship>United Kingdom</Citizenship>
                        <PreviousAddress>
                          <Street>34 Via Roma</Street>
                          <City>Rome</City>
                          <PostalCode>00100</PostalCode>
                          <Country>Italy</Country>
                        </PreviousAddress>
                        <ResidentialStatus>Rent</ResidentialStatus>
                        <YearsAtCurrentAddress>3</YearsAtCurrentAddress>
                      </Applicant>
                      <EmploymentInformation>
                        <Employer>Medical Clinic</Employer>
                        <Occupation>Physician assistant</Occupation>
                        <EmploymentDuration>5 years</EmploymentDuration>
                        <WorkLocation>
                          <Street>7 Friedrichstraße</Street>
                          <City>Berlin</City>
                          <PostalCode>10117</PostalCode>
                          <Country>Germany</Country>
                        </WorkLocation>
                      </EmploymentInformation>
                      <FinancialInformation>
                        <Income>75000.50</Income>
                        <CreditScore>800</CreditScore>
                        <MonthlyExpenses>3000.75</MonthlyExpenses>
                        <RequestedLoanAmount>50000.00</RequestedLoanAmount>
                        <PurposeOfLoan>Home Purchase</PurposeOfLoan>
                      </FinancialInformation>
                
                

As seen from the examples, the naming and nesting of tags vary significantly.

Let's create a project credit-forms-unification-sample and open it. After that, a series of buttons will be activated for adding sections and subsections.

Let's create a section in the project. The section name is needed for logical data grouping. For each individual section, you can define specific document parsing rules. In the case of exporting to NoSQL, the section name is used as the collection name. In this project, we want to standardize several different documents. Therefore, we will have one section and one subsection.

Let's add a subsection. In the case of tabular data representation, the subsection name should match the name of the table where the zero-level nested data will be placed.

Switch to the subsection and describe the data structure in the intermediate representation. More details on how to describe the intermediate representation are covered in other sections of the documentation. Save the changes.

Open the first XML document from the folder data-samples/credit-forms-unification-sample and click the Morphology button. After that, in the bottom menu, we will see a list of tags for which mappings need to be set.

Establish mappings between each element from the list of tags and the intermediate representation by clicking Add to Rules. If a tag chain is not needed, send it to the ignored list by using the Add to Ignore button.

Repeat the above action for all tag chains:

Now, after clicking the Morphology button, we will see our intermediate representation filled with data.

Open a new XML file, click Morphology. As seen, due to the different XML structure, mapping occurred only for a small portion of the data.

Perform mapping for all remaining tags.

Open the third file and repeat all actions for it.

After that, its structure will look like the following:

However, now if we try to perform parsing by clicking Parse, we will get an incorrect result where part of the data is missing:

As seen from the example, the data for the applicant_finance section is missing. In the XML itself, this section has two different spellings.

DeutscheBank:
                
                      <FinancialInfo>
                        <Salary>50000.75</Salary>
                        <CreditRating>710</CreditRating>
                        <MonthlyExpenses>2700.50</MonthlyExpenses>
                        <LoanAmount>42000.00</LoanAmount>
                        <LoanPurpose>Further Education</LoanPurpose>
                        <ApplicationDate>2024-03-25</ApplicationDate>
                        <LoanOfficerName>Luca Rossi</LoanOfficerName>
                    </FinancialInfo>
                
                
Barclays:
                
                      <FinancialInfo>
                        <Salary>71000</Salary>
                        <CreditRating>750</CreditRating>
                        <MonthlyExpenses>2600.50</MonthlyExpenses>
                        <LoanAmount>48000.00</LoanAmount>
                        <LoanPurpose>Education Expenses</LoanPurpose>
                      </FinancialInfo>
                
                
RaboBank:
                
                      <FinancialInformation>
                        <Income>75000.50</Income>
                        <CreditScore>800</CreditScore>
                        <MonthlyExpenses>3000.75</MonthlyExpenses>
                        <RequestedLoanAmount>50000.00</RequestedLoanAmount>
                        <PurposeOfLoan>Home Purchase</PurposeOfLoan>
                      </FinancialInformation>
                
                

Switch to Grow Rules and specify the possible spellings of its name in actual files:

Now, after clicking the Parse button, the intermediate representation will look like this:

JSON representation:

SQL representation:

Please note that the provided data is not suitable for database export yet, as there is no type conversion, and all tag values are represented as strings.

Learn more about how to extract an arbitrary set of fields from XML and use it to update an existing database in this guide.