Type Casting in XML Documents
By default, all tag values are extracted as text. That is, if the tag value is, for example, 123
, it will be extracted as a string "123"
.
SmartXML provides several mechanisms for handling different types, described below.
Handling Simple Cases:
In cases where data can be unambiguously interpreted as integer, float, or bool, you can use the Rules⮞Tag Casting Rules tab to specify which tags should be cast to which types.
Handling Complex Cases:
In cases where data needs cleaning or modification before extraction, SmartXML offers an embedded TinyNLP engine based on Red/Parse. This engine allows performing preprocessing tasks on data that cannot be achieved using regular expressions or grammatically unsupported languages.
It's worth noting that Red/Parse provides only a small subset of functions for text processing and cannot replace a full-fledged NLP processing. However, in certain situations, its capabilities may be sufficient for solving specific tasks.
While using TinyNLP requires writing code, for processing most XML, this may not be necessary at all, or the amount of code will be minimal. It is also assumed that testing and debugging should take place directly in the red-lang language interpreter.
Let's assume we have the following improperly filled XML:
<data>
<name>John Doe</name>
<age>32</age>
<height>185.5</height>
<salary>245000€</salary>
<contactInfo>phone: +31 619653239 email: mail@example.org</contactInfo>
<isCitizen>Yes</isCitizen>
<MaritalStatus>Single</MaritalStatus>
</data>
Project name: complex-types-sample
One section Section-A is created in the project, and it contains one subsection named sample
The test XML is located in the directory: data-samples/complex-types-sample/example.xml
This XML has the following issues:
- salary contains a currency symbol where there should be a number
- contactInfo contains both a phone number and an email address
- isCitizen and MaritalStatus are clearly boolean types
- Additionally, let's assume we want to convert the exchange rate from euros to dollars in real-time
We will describe the data in the sample subsection in the format of an intermediate representation. We will also specify the fields we want to extract from the original data and save the result.
name: none
age: none
height: none
contact_info: none
contact_phone: none
contact_email: none
salary: none
salary_currency: none
salary_in_usd: none
is_sitizen: none
is_marital: none
Let's press Morphology and establish correspondences between the names in the intermediate representation and the tags in the XML. As a result, we get the following representation:
name: "John Doe"
age: "32"
height: "185.5"
contact_info: "phone: +31 619653239 email: mail@example.org"
contact_phone: none
contact_email: none
salary: "245000€"
salary_currency: none
salary_in_usd: none
is_sitizen: "Yes"
is_marital: "Single"
Now we need to create rules filling in the fields with the value none
with data.
Open the project file in a text editor: projects/complex-types-sample/rules/complex-extract-rules.red. In it, for each tag of each section, you can specify rules for its processing or splitting into other tags.
By default, it only contains the names of the created sections and looks like this:
section-A: []
The format for writing rules is as follows:
tag-name: [
derivative-tagA: [
; processing of tag-name content. It's value in tag-value variable
]
derivative-tagB: [
; processing of tag-name content. It's value in tag-value variable
]
derivative-tagC: [
; processing of tag-name content. It's value in tag-value variable
]
]
another-tag-name: [
another-tag-name: [
; processing of another-tag-name content. It's value in tag-value variable
]
another-tag-derivativeA: [
; processing of another-tag-name content. It's value in tag-value variable
]
]
Inside the tag to be processed, there must be at least one child. If the child's name matches the parent's name, it means processing the parent tag itself.
tag-value
is a variable that contains the value of the processed tag. Passing tag-value
happens implicitly. You just need to write the processing of the tag-value
variable. For testing purposes, you can replace it with the required line with data, but when inserting into the rule, return the name tag-value
.
Processing values occurs from top to bottom. If you change tag-value
at step A, then at step B, you will get the modified value.
Each block must have a return
with a value. return
can be implicit if the function itself returns the modified value (for example, replace
). Also, the last value of the function will be the return value.
section-A: [
salary: [
salary [
; remove the currency symbol. copy allows not to change tag-value
; value will be set for the salary itself
replace copy tag-value "$" ""
replace copy tag-value "€" ""
]
salary_currency: [
; set the value of the salary_currency tag based on the currency type
result: none
if find tag-value "$" [result: "USD"]
if find tag-value "€" [result: "EUR"]
return result
]
salary_in_usd: [
; remove currency characters from the tag-value, since now we just need a number here
replace tag-value "$" ""
replace tag-value "€" ""
; get the current dollar rate from an external service
; check docs https://fixer.io/ to get API Key
data: load-json read http://data.fixer.io/api/latest?access_key=[your-api-key]
; print ["currency:" data/rates/USD] ; print if necessary
; multiply the dollar rate by the current currency rate
; and return it as the value of salary_in_usd
return round/even (data/rates/USD * (to-integer tag-value))
]
]
contact_info: [
; extract from contact_info phone
contact_phone: [
parse tag-value [thru "phone:" copy phone to "email:"]
return phone
]
; extract from contact_info email
contact_email: [
parse tag-value [thru "email:" copy email to end]
return email
]
]
is_marital: [
; replacement to text that can be converted to type
is_marital: [
replace tag-value "Single" "false"
replace tag-value "Married" "true"
]
]
]
This example only illustrates the general principles of complex type processing and is not universal.
The same tasks can be solved with different code.
The example demonstrates calling an external service to get the exchange rate. The example is purely illustrative and is not recommended for streaming processing of a large number of XML files.
We provide assistance in writing rules for complex processing within Premium Support
name: "John Doe"
age: "32"
height: "185.5"
contact_info: "phone: +31 619653239 email: mail@example.org"
contact_phone: "+31 619653239"
contact_email: "mail@example.org"
salary: "245000"
salary_currency: "EUR"
salary_in_usd: 267435.0
is_sitizen: "Yes"
is_marital: "false"
Please note that type casting occurred only for salary_in_usd
as it is explicitly specified in the action on the to-integer tag-value
tag.
Although the rule performs operations on salary
itself, there is no type casting operation among them. You can either modify the rule itself or perform type casting on the Rules⮞Tag Casting Rules tab.
The content of age
and height
allows type casting to an integer without additional complex parsing rules.
SmartXML allows converting text strings like yes
, no
, on
, off
, true
, false
to a boolean type without additional complex rules and code. However, for the is_marital: "Single"
tag, you need to write a rule that initially replaces the specified string with one of the specified values, and then the value itself can be converted to a boolean type on the Tag Casting Rules tab.
Please note that the letter case does not matter, and yes
, Yes
, and YES
will be processed the same way.
Now, after clicking the Parse button, you will obtain an intermediate representation with correctly casted types:
name: "John Doe"
age: 32
height: 185
contact_info: "phone: +31 619653239 email: mail@example.org"
contact_phone: "+31 619653239"
contact_email: "mail@example.org"
salary: 245000
salary_currency: "EUR"
salary_in_usd: 267435.0
is_sitizen: true
is_marital: false
Now, if necessary, the string contact_info: "phone: +31 619653239 email: mail@example.org"
can be removed.
JSON representation:
{
"name": "John Doe",
"age": 32,
"height": 185,
"contact_phone": "+31 619653239",
"contact_email": "mail@example.org",
"salary": 245000,
"salary_currency": "EUR",
"salary_in_usd": 268673.0,
"is_sitizen": true,
"is_marital": false
}
SQL representation:
INSERT INTO sample ("name", "age", "height", "contact_phone", "contact_email", "salary", "salary_currency", "salary_in_usd", "is_sitizen", "is_marital")
VALUES ('John Doe', 32, 185, '+31 619653239', 'mail@example.org', 245000, 'EUR', 268673.0, true, false);