All projects must be located in the projects folder.
For installation version projects is located in user's HOME directory inside SmartXML directory.
If the Portable version is used, the projects folder must be located next to the program and have write permissions.
Comments begin with the symbol ;.
Directory structure:
projects/
└── sample-project/
├── templates/
│ └── data-templates.red
├── ignores/
│ └── section_name.txt
├── rules/
│ ├── tags-matching-rules.red
│ ├── grow-rules.red
│ ├── injection-rules.red
│ ├── db-constraints-rules.red
│ ├── tags-casting-rules.red
│ ├── complex-extract-rules.red
├── config.txt
└── job.txt
Sample XML for parsing:
<data>
<account>
<miscInfo>Some miscellaneous information</miscInfo>
<ownerName>John Doe</ownerName>
<accountId>94967295</accountId>
<phoneNumber>+1234567890</phoneNumber>
<email>johndoe@example.com</email>
<preferredLanguage>English</preferredLanguage>
<themePreference>Dark</themePreference>
<bankInfo>
<bankName>Bank of America</bankName>
<branchCode>001</branchCode>
<accountType>Personal</accountType>
</bankInfo>
</account>
<transactions>
<transaction>
<amount>1200.00</amount>
<currency>USD</currency>
<transactionID>TX1003047</transactionID>
<networkProvider>XYZ Telecom</networkProvider>
</transaction>
<transaction>
<amount>2700.00</amount>
<currency>USD</currency>
<transactionID>TX1003048</transactionID>
<networkProvider>XYZ Telecom</networkProvider>
</transaction>
</transactions>
</data>
In order to parse XML, it is necessary to describe it in an intermediate representation and set the rules for transforming this representation.
The intermediate representation is described in the file: data-templates.red.
In the minimal variant it is obligatory to have at least one section and one sub-section.
Sections are separate types of XML documents. Subsections are subtypes.
For each section it is possible to set unique rules of XML document transformation.
For all subsections, the parsing and transformation rules from the parent section are used.
The name of the subsection is used as a table name when converting to SQL for data located in the root of XML.
Example data-templates.red containing one section and one subsection. The intermediate data representation is not described:
#[
section_name: #[
subsection_name: #[
]
]
]
An example of two sections (XML types). The second section has two XML subtypes.
#[
section_name: #[
subsection_name: #[]
]
section2_name: #[
subsection1_name: #[]
subsection2_name: #[]
]
]
A section must be created for each rule in the rules folder, even if the rule is not used.
Each subtype must contain a description of the XML structure in an intermediate format. The structure of the intermediate format is described below.
All nodes to be represented as tables (or as JSON arrays) must have a parent that will be used as the table name (or array name in the case of JSON). Example:
parent_name: [ ; container that will be used as name of table or name of json array
some_node: [ ; node corresponding to a real node in XML
]
]
All values to be extracted must be described in the format: value_name: none
.
The value names will be used as database column names (in case of conversion to SQL) or as JSON field names (in case of conversion to JSON).
Below is an example of a filled data-templates.red from which three tables (bank_transaction, banks, transactions) and two JSON arrays (banks and transactions) can be generated.
#[
section_name: #[
bank_transaction: #[ ; subsection name is used as a table name
owner_name: none
account_id: none
phone_number: none
email: none
banks: [ ; parent of a node that has a direct descendant of another node is used as the name of a table (or JSON array).
bank: [
bank_name: none
branch_code: none
account_type: none
]
]
transactions: [ ; parent of a node that has a direct descendant of another node is used as the name of a table (or JSON array).
transaction: [
amount: none
currency: none
tid: none
]
]
]
]
]
The tags-matching-rules.red describes a name relationship in an intermediate representation with a sequence of nodes (tag chains) to the target data.
If there are different spellings of the same nodes in XML, they can all be specified (as in the tid example).
All node names must be unique.
Example of filling: tags-matching-rules.red.
section_name: #[
owner_name: [
"data account ownerName"
]
account_id: [
"data account accountId"
]
phone_number: [
"data account phoneNumber"
]
email: [
"data account email"
]
bank_name: [
"data account bankInfo bankName"
]
branch_code: [
"data account bankInfo branchCode"
]
account_type: [
"data account bankInfo accountType"
]
amount: [
"data transactions transaction amount"
]
currency: [
"data transactions transaction currency"
]
tid: [
"data transactions transaction transactionID"
"data transactions transaction alternativeTransactionIdSpelling"
]
]
If the XML contains nodes whose values we do not need, a file is created in the ignores directory for each section that describes the sequence of nodes to be ignored.
An example of the contents of section_name.txt:
["data" "account" "miscInfo"]
["data" "account" "preferredLanguage"]
["data" "account" "themePreference"]
["data" "transactions" "transaction" "networkProvider"]
The grow-rules.red specifies the names of nodes in the intermediate representation and their XML spelling variants. Without this rule, when nodes are encountered, the application will skip them.
Correct node completion is due to the grow-rules.red rule:
section_name: [
transaction: ["transaction" "transactionSpellingA" "transactionSpellingB"]
bank: ["bankInfo"]
]
An example of JSON generated without grow-rules.red rule:
{
"owner_name": "John Doe",
"account_id": "94967295",
"phone_number": "+1234567890",
"email": "johndoe@example.com"
}
An example of SQL generated without grow-rules.red rule:
INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');
JSON generated with grow-rules.red:
{
"owner_name": "John Doe",
"account_id": "94967295",
"phone_number": "+1234567890",
"email": "johndoe@example.com",
"banks": [
{
"bank_name": "Bank of America",
"branch_code": "001",
"account_type": "Personal"
}
],
"transactions": [
{
"amount": "1200.00",
"currency": "USD",
"tid": "TX1003047"
},
{
"amount": "2700.00",
"currency": "USD",
"tid": "TX1003048"
}
]
}
Generated SQL:
INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');
INSERT INTO banks ("bank_name", "branch_code", "account_type")
VALUES ('Bank of America', '001', 'Personal');
INSERT INTO transactions ("amount", "currency", "tid")
VALUES ('1200.00', 'USD', 'TX1003047'),
('2700.00', 'USD', 'TX1003048');
In case the nesting of nodes is not specified in this way, but described based on the XML 1to1 structure, i.e. omit banks:
#[
section_name: #[
bank_transaction: #[
owner_name: none
account_id: none
phone_number: none
email: none
bank: [
bank_name: none
branch_code: none
account_type: none
]
transactions: [
transaction: [
amount: none
currency: none
tid: none
]
]
]
]
]
Then when generating SQL and JSON, the contents of banks will be ignored:
JSON example:
{
"owner_name": "John Doe",
"account_id": "94967295",
"phone_number": "+1234567890",
"email": "johndoe@example.com",
"transactions": [
{
"amount": "1200.00",
"currency": "USD",
"tid": "TX1003047"
},
{
"amount": "2700.00",
"currency": "USD",
"tid": "TX1003048"
}
]
}
Example SQL:
INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com'),
('Bank of America', '001', 'Personal');
INSERT INTO transactions ("amount", "currency", "tid")
VALUES ('1200.00', 'USD', 'TX1003047'),
('2700.00', 'USD', 'TX1003048');
If the contents of banks do not need to be stored in a separate table (or JSON array), the banks node should be omitted:
#[
section_name: #[
bank_transaction: #[
owner_name: none
account_id: none
phone_number: none
email: none
bank_name: none
branch_code: none
account_type: none
transactions: [
transaction: [
amount: none
currency: none
tid: none
]
]
]
]
]
JSON:
{
"owner_name": "John Doe",
"account_id": "94967295",
"phone_number": "+1234567890",
"email": "johndoe@example.com",
"bank_name": "Bank of America",
"branch_code": "001",
"account_type": "Personal",
"transactions": [
{
"amount": "1200.00",
"currency": "USD",
"tid": "TX1003047"
},
{
"amount": "2700.00",
"currency": "USD",
"tid": "TX1003048"
}
]
}
SQL:
INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email", "bank_name", "branch_code", "account_type")
VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com', 'Bank of America', '001', 'Personal');
INSERT INTO transactions ("amount", "currency", "tid")
VALUES ('1200.00', 'USD', 'TX1003047'),
('2700.00', 'USD', 'TX1003048');
Let's restore the original data-templates.red structure:
#[
section_name: #[
bank_transaction: #[ ; subsection name is used as a table name
owner_name: none
account_id: none
phone_number: none
email: none
banks: [ ; parent of a node that has a direct descendant of another node is used as the name of a table (or JSON array).
bank: [
bank_name: none
branch_code: none
account_type: none
]
]
transactions: [ ; parent of a node that has a direct descendant of another node is used as the name of a table (or JSON array).
transaction: [
amount: none
currency: none
tid: none
]
]
]
]
]
injection-rules.red allows you to pass values from parents to descendants:
section_name: [
enumerate-nodes: []
inject-tag-to-every-children: [account_id]
injection-tag-and-recipients: []
]
When this rule is used, account_id will be passed to all descendants a level below.
JSON:
{
"owner_name": "John Doe",
"account_id": "94967295",
"phone_number": "+1234567890",
"email": "johndoe@example.com",
"bank_name": "Bank of America",
"branch_code": "001",
"account_type": "Personal",
"transactions": [
{
"account_id": "94967295",
"amount": "1200.00",
"currency": "USD",
"tid": "TX1003047"
},
{
"account_id": "94967295",
"amount": "2700.00",
"currency": "USD",
"tid": "TX1003048"
}
]
}
SQL:
INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');
INSERT INTO banks ("account_id", "bank_name", "branch_code", "account_type")
VALUES ('94967295', 'Bank of America', '001', 'Personal');
INSERT INTO transactions ("account_id", "amount", "currency", "tid")
VALUES ('94967295', '1200.00', 'USD', 'TX1003047'),
('94967295', '2700.00', 'USD', 'TX1003048');
It is possible to pass tags only to specific descendants.
injection-rules.red:
section_name: [
enumerate-nodes: []
inject-tag-to-every-children: []
injection-tag-and-recipients: [
account_id: [transaction]
owner_name: [bank transaction]
]
]
In this case, account_id will only be passed to transactions and owner_name to the banks and transactions node.
JSON:
{
"owner_name": "John Doe",
"account_id": "94967295",
"phone_number": "+1234567890",
"email": "johndoe@example.com",
"banks": [
{
"owner_name": "John Doe",
"bank_name": "Bank of America",
"branch_code": "001",
"account_type": "Personal"
}
],
"transactions": [
{
"owner_name": "John Doe",
"account_id": "94967295",
"amount": "1200.00",
"currency": "USD",
"tid": "TX1003047"
},
{
"owner_name": "John Doe",
"account_id": "94967295",
"amount": "2700.00",
"currency": "USD",
"tid": "TX1003048"
}
]
}
SQL:
INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');
INSERT INTO banks ("owner_name", "bank_name", "branch_code", "account_type")
VALUES ('John Doe', 'Bank of America', '001', 'Personal');
INSERT INTO transactions ("owner_name", "account_id", "amount", "currency", "tid")
VALUES ('John Doe', '94967295', '1200.00', 'USD', 'TX1003047'),
('John Doe', '94967295', '2700.00', 'USD', 'TX1003048');
Note that it is the name of the end nodes in the intermediate representation that must be specified. So not transactions but transaction, not banks but banks.
We also have the possibility to number multiple nodes by inserting the value: item_number.
The rule will look like this:
section_name: [
enumerate-nodes: [transaction]
inject-tag-to-every-children: []
injection-tag-and-recipients: []
]
Result:
{
"owner_name": "John Doe",
"account_id": "94967295",
"phone_number": "+1234567890",
"email": "johndoe@example.com",
"banks": [
{
"bank_name": "Bank of America",
"branch_code": "001",
"account_type": "Personal"
}
],
"transactions": [
{
"item_number": 1,
"amount": "1200.00",
"currency": "USD",
"tid": "TX1003047"
},
{
"item_number": 2,
"amount": "2700.00",
"currency": "USD",
"tid": "TX1003048"
}
]
}
SQL:
INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');
INSERT INTO banks ("bank_name", "branch_code", "account_type")
VALUES ('Bank of America', '001', 'Personal');
INSERT INTO transactions ("item_number", "amount", "currency", "tid")
VALUES (1, '1200.00', 'USD', 'TX1003047'),
(2, '2700.00', 'USD', 'TX1003048');
If it is required to consider uniqueness restrictions on database columns, the db-constraints-rules.red rule is used:
section_name: [
transactions: [account_id]
banks: [bank branch_code]
]
JSON is not affected by it:
{
"owner_name": "John Doe",
"account_id": "94967295",
"phone_number": "+1234567890",
"email": "johndoe@example.com",
"banks": [
{
"bank_name": "Bank of America",
"branch_code": "001",
"account_type": "Personal"
}
],
"transactions": [
{
"amount": "1200.00",
"currency": "USD",
"tid": "TX1003047"
},
{
"amount": "2700.00",
"currency": "USD",
"tid": "TX1003048"
}
]
}
However, SQL generation generates a query subject to uniqueness constraints:
INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');
INSERT INTO banks ("bank_name", "branch_code", "account_type")
VALUES ('Bank of America', '001', 'Personal')
ON CONFLICT ("bank", "branch_code")
DO UPDATE SET "bank_name" = EXCLUDED."bank_name", "branch_code" = EXCLUDED."branch_code", "account_type" = EXCLUDED."account_type";
INSERT INTO transactions ("amount", "currency", "tid")
VALUES ('1200.00', 'USD', 'TX1003047'),('2700.00', 'USD', 'TX1003048')
ON CONFLICT ("account_id")
DO UPDATE SET "amount" = EXCLUDED."amount", "currency" = EXCLUDED."currency", "tid" = EXCLUDED."tid";
By default, all values from XML are extracted as text values.
Type conversion rules are defined in tags-casting-rules.red.
section_name: [
integer-list: [[to-integer to-float] [account_id amount]]
float-list: [[to-float] []]
bool-list: [[reduce to-word] []]
]
The rules of type conversion are discussed in more detail in the article type-conversion-in-xml-documents.
JSON:
{
"owner_name": "John Doe",
"account_id": 94967295,
"phone_number": "+1234567890",
"email": "johndoe@example.com",
"banks": [
{
"bank_name": "Bank of America",
"branch_code": "001",
"account_type": "Personal"
}
],
"transactions": [
{
"amount": 1200,
"currency": "USD",
"tid": "TX1003047"
},
{
"amount": 2700,
"currency": "USD",
"tid": "TX1003048"
}
]
}
SQL:
INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
VALUES ('John Doe', 94967295, '+1234567890', 'johndoe@example.com');
INSERT INTO banks ("bank_name", "branch_code", "account_type")
VALUES ('Bank of America', '001', 'Personal');
INSERT INTO transactions ("amount", "currency", "tid")
VALUES (1200, 'USD', 'TX1003047'),
(2700, 'USD', 'TX1003048');
The rules of complex-extract-rules.red are described in article type-conversion-in-xml-documents
config.txt structure:
; allowed combinations: true false "..." none
root-xml-folder: "" ; if xml_path in DB is relative
xml-filling-stat: true ; table: filling_percent_stat should exists
ignore-namespaces: true
ignore-tag-attributes: true
use-same-morphology-for-same-file-name-pattern: false
skip-schema-version-tag: true
use-same-morphology-for-all-files-in-folder: false
delete-data-before-insert: none
connect-to-db-at-project-opening: false
source-database: "PostgreSQL" ; available values: PostgreSQL/SQLite
target-database: "PostgreSQL" ; available values: PostgreSQL/SQLite/NoSQL
bot-chatID: ""
bot-token: ""
telegram-notifications: none
db-driver: ""
db-server: "127.0.0.1"
db-port: "5432"
db-name: "test"
db-user: "postgres"
db-pass: ""
sqlite-driver-name: none
sqlite-db-path: none
nosql-url: ""
append-subsection_name-to-nosql-url: true
no-sql-login: "root"
no-sql-pass: ""
job.txt is necessary if references to XML are located in the database
job-number: 1 ; default where condition block number
xml-table-name: "xml_files" ; name of table where full path to XML is store
limit: 1000
order_by: "id DESC"
select-fields: [
id xml_path ; columns id and xml_path is required in xml_files table!
]
where-condition: [
[
job_numnber: [1 2 3]
doc_date: [2024-01-01 2030-01-01] ; date range
region: ["USA"]
doc_type: "type_a" ; "type_a" "type_b" "type_c"
parsing_status: "NULL" ; "NULL" "NOT NULL" "someWord" or logical: true\false
action_type: "data_insert" ; or data_update if doc only patch table
]
[
job_numnber: [5 6 7]
doc_date: [2024-01-01 2030-01-01] ; date range
region: ["Europe" "Australia"]
doc_type: "type_a" ; "type_a" "type_b" "type_c"
parsing_status: "NULL" ; "NULL" ; "NULL" "NOT NULL" "someWord" or logical: true\false
some_other_column: "NULL" ; "NULL" ; "NULL" "NOT NULL" "someWord" or logical: true\false
action_type: "data_insert" ; or data_update if doc only patch table
]
]
The structure of the assignments is described in more detail in the article jobs-settings