Type Casting in XML Documents

By default, all tag values are extracted as text. That is, if the tag value is, for example, 123, it will be extracted as a string "123".

SmartXML provides several mechanisms for handling different types, described below.

Handling Simple Cases:

In cases where data can be unambiguously interpreted as integer, float, or bool, you can use the RulesTag Casting Rules tab to specify which tags should be cast to which types.

Handling Complex Cases:

In cases where data needs cleaning or modification before extraction, SmartXML offers an embedded TinyNLP engine based on Red/Parse. This engine allows performing preprocessing tasks on data that cannot be achieved using regular expressions or grammatically unsupported languages.

It's worth noting that Red/Parse provides only a small subset of functions for text processing and cannot replace a full-fledged NLP processing. However, in certain situations, its capabilities may be sufficient for solving specific tasks.

While using TinyNLP requires writing code, for processing most XML, this may not be necessary at all, or the amount of code will be minimal. It is also assumed that testing and debugging should take place directly in the red-lang language interpreter.

Let's assume we have the following improperly filled XML:


        <?xml version="1.0" encoding="UTF-8"?>
        <data>
          <name>John Doe</name>
          <age>32</age>
          <height>185.5</height>
          <salary>245000€</salary>
          <contactInfo>phone: +31 619653239 email: mail@example.org</contactInfo>
          <isCitizen>Yes</isCitizen>
          <MaritalStatus>Single</MaritalStatus>
        </data>
      

Project name: complex-types-sample

One section Section-A is created in the project, and it contains one subsection named sample

The test XML is located in the directory: data-samples/complex-types-sample/example.xml

This XML has the following issues:

  • salary contains a currency symbol where there should be a number
  • contactInfo contains both a phone number and an email address
  • isCitizen and MaritalStatus are clearly boolean types
  • Additionally, let's assume we want to convert the exchange rate from euros to dollars in real-time

We will describe the data in the sample subsection in the format of an intermediate representation. We will also specify the fields we want to extract from the original data and save the result.


        name: none
        age: none
        height: none
        contact_info: none
        contact_phone: none
        contact_email: none
        salary: none
        salary_currency: none
        salary_in_usd: none
        is_sitizen: none
        is_marital: none
      

Let's press Morphology and establish correspondences between the names in the intermediate representation and the tags in the XML. As a result, we get the following representation:


          name: "John Doe"
          age: "32"
          height: "185.5"
          contact_info:  "phone: +31 619653239 email: mail@example.org"
          contact_phone: none
          contact_email: none
          salary: "245000€"
          salary_currency: none
          salary_in_usd: none
          is_sitizen: "Yes"
          is_marital: "Single"
      

Now we need to create rules filling in the fields with the value none with data.

Open the project file in a text editor: projects/complex-types-sample/rules/complex-extract-rules.red. In it, for each tag of each section, you can specify rules for its processing or splitting into other tags.

By default, it only contains the names of the created sections and looks like this:


            section-A: []
        

The format for writing rules is as follows:


          tag-name: [
              derivative-tagA: [ 
              ; processing of tag-name content. It's value in tag-value variable
              ]
              derivative-tagB: [ 
              ; processing of tag-name content. It's value in tag-value variable
              ]
              derivative-tagC: [ 
              ; processing of tag-name content. It's value in tag-value variable
              ]              
          ]
          another-tag-name: [
              another-tag-name: [ 
              ; processing of another-tag-name content. It's value in tag-value variable
              ]
              another-tag-derivativeA: [ 
              ; processing of another-tag-name content. It's value in tag-value variable
              ]
          ]
        

Inside the tag to be processed, there must be at least one child. If the child's name matches the parent's name, it means processing the parent tag itself.

tag-value is a variable that contains the value of the processed tag. Passing tag-value happens implicitly. You just need to write the processing of the tag-value variable. For testing purposes, you can replace it with the required line with data, but when inserting into the rule, return the name tag-value.

Processing values occurs from top to bottom. If you change tag-value at step A, then at step B, you will get the modified value.

Each block must have a return with a value. return can be implicit if the function itself returns the modified value (for example, replace). Also, the last value of the function will be the return value.


       section-A: [
              salary: [
                  salary [
                      ; remove the currency symbol. copy allows not to change tag-value
                      ; value will be set for the salary itself
                      replace copy tag-value "$" ""
                      replace copy tag-value "€" ""
                  ] 
                  salary_currency: [
                      ; set the value of the salary_currency tag based on the currency type
                      result: none
                      if find tag-value "$" [result: "USD"]
                      if find tag-value "€" [result: "EUR"]
                      return result
                  ]
                  salary_in_usd: [
                      ; remove currency characters from the tag-value, since now we just need a number here
                      replace tag-value "$" ""
                      replace tag-value "€" ""
                      ; get the current dollar rate from an external service
                      ; check docs https://fixer.io/ to get API Key
                      data: load-json read http://data.fixer.io/api/latest?access_key=[your-api-key]
                      ; print ["currency:" data/rates/USD] ; print if necessary
                      ; multiply the dollar rate by the current currency rate
                      ; and return it as the value of salary_in_usd
                      return round/even (data/rates/USD * (to-integer tag-value))
                  ]
              ]
          
              contact_info: [
                  ; extract from contact_info phone
                  contact_phone: [
                      parse tag-value [thru "phone:" copy phone to "email:"]
                      return phone
                  ]
                  ; extract from contact_info email
                  contact_email: [
                      parse tag-value [thru "email:" copy email to end]
                      return email
                  ]
               ]

               is_marital: [
                  ; replacement to text that can be converted to type
                  is_marital: [
                      replace tag-value "Single" "false"
                      replace tag-value "Married" "true"
               ]
           ]
               
    ] 
      

This example only illustrates the general principles of complex type processing and is not universal.

The same tasks can be solved with different code.

The example demonstrates calling an external service to get the exchange rate. The example is purely illustrative and is not recommended for streaming processing of a large number of XML files.

We provide assistance in writing rules for complex processing within Premium Support

          name: "John Doe"
          age: "32"
          height: "185.5"
          contact_info: "phone: +31 619653239 email: mail@example.org"
          contact_phone: "+31 619653239"
          contact_email: "mail@example.org"
          salary: "245000"
          salary_currency: "EUR"
          salary_in_usd: 267435.0
          is_sitizen: "Yes"
          is_marital: "false"
      

Please note that type casting occurred only for salary_in_usd as it is explicitly specified in the action on the to-integer tag-value tag.

Although the rule performs operations on salary itself, there is no type casting operation among them. You can either modify the rule itself or perform type casting on the RulesTag Casting Rules tab.

The content of age and height allows type casting to an integer without additional complex parsing rules.

SmartXML allows converting text strings like yes, no, on, off, true, false to a boolean type without additional complex rules and code. However, for the is_marital: "Single" tag, you need to write a rule that initially replaces the specified string with one of the specified values, and then the value itself can be converted to a boolean type on the Tag Casting Rules tab.

Please note that the letter case does not matter, and yes, Yes, and YES will be processed the same way.

Now, after clicking the Parse button, you will obtain an intermediate representation with correctly casted types:


        name: "John Doe"
        age: 32
        height: 185
        contact_info: "phone: +31 619653239 email: mail@example.org"
        contact_phone: "+31 619653239"
        contact_email: "mail@example.org"
        salary: 245000
        salary_currency: "EUR"
        salary_in_usd: 267435.0
        is_sitizen: true
        is_marital: false
    

Now, if necessary, the string contact_info: "phone: +31 619653239 email: mail@example.org" can be removed.

JSON representation:


      {
         "name": "John Doe",
         "age": 32,
         "height": 185,
         "contact_phone": "+31 619653239",
         "contact_email": "mail@example.org",
         "salary": 245000,
         "salary_currency": "EUR",
         "salary_in_usd": 268673.0,
         "is_sitizen": true,
         "is_marital": false
      }
      

SQL representation:


      INSERT INTO sample ("name", "age", "height", "contact_phone", "contact_email", "salary", "salary_currency", "salary_in_usd", "is_sitizen", "is_marital")
           VALUES ('John Doe', 32, 185, '+31 619653239', 'mail@example.org', 245000, 'EUR', 268673.0, true, false);