Project struct

All projects must be located in the projects folder.

For installation version projects is located in user's HOME directory inside SmartXML directory.

If the Portable version is used, the projects folder must be located next to the program and have write permissions.

Comments begin with the symbol ;.

Directory structure:


   projects/
   └── sample-project/
      ├── templates/
      │   └── data-templates.red    
      ├── ignores/
      │   └── section_name.txt
      ├── rules/
      │   ├── tags-matching-rules.red
      │   ├── grow-rules.red
      │   ├── injection-rules.red
      │   ├── db-constraints-rules.red
      │   ├── tags-casting-rules.red
      │   ├── complex-extract-rules.red
      ├── config.txt
      └── job.txt

Sample XML for parsing:


<data>
   <account>
     <miscInfo>Some miscellaneous information</miscInfo>
     <ownerName>John Doe</ownerName>
     <accountId>94967295</accountId>
     <phoneNumber>+1234567890</phoneNumber>
     <email>johndoe@example.com</email>
     <preferredLanguage>English</preferredLanguage>
     <themePreference>Dark</themePreference>
     <bankInfo>
       <bankName>Bank of America</bankName>
       <branchCode>001</branchCode>
       <accountType>Personal</accountType>
     </bankInfo>
   </account>
   <transactions>
     <transaction>
         <amount>1200.00</amount>
         <currency>USD</currency>
         <transactionID>TX1003047</transactionID>
         <networkProvider>XYZ Telecom</networkProvider>
     </transaction>
     <transaction>
         <amount>2700.00</amount>
         <currency>USD</currency>
         <transactionID>TX1003048</transactionID>
         <networkProvider>XYZ Telecom</networkProvider>
     </transaction>        
   </transactions>
</data>

In order to parse XML, it is necessary to describe it in an intermediate representation and set the rules for transforming this representation.

The intermediate representation is described in the file: data-templates.red.

In the minimal variant it is obligatory to have at least one section and one sub-section.

Sections are separate types of XML documents. Subsections are subtypes.

For each section it is possible to set unique rules of XML document transformation.

For all subsections, the parsing and transformation rules from the parent section are used.

The name of the subsection is used as a table name when converting to SQL for data located in the root of XML.

Example data-templates.red containing one section and one subsection. The intermediate data representation is not described:


#[
    section_name: #[
        subsection_name: #[

        ]
    ]
]

An example of two sections (XML types). The second section has two XML subtypes.


#[
    section_name: #[
        subsection_name: #[]
    ]
    section2_name: #[
        subsection1_name: #[]
        subsection2_name: #[]
    ]
]

A section must be created for each rule in the rules folder, even if the rule is not used.

Each subtype must contain a description of the XML structure in an intermediate format. The structure of the intermediate format is described below.

All nodes to be represented as tables (or as JSON arrays) must have a parent that will be used as the table name (or array name in the case of JSON). Example:


parent_name: [ ; container that will be used as  name of table or name of json array
   some_node: [ ; node corresponding to a real node in XML

   ]
]

All values to be extracted must be described in the format: value_name: none.

The value names will be used as database column names (in case of conversion to SQL) or as JSON field names (in case of conversion to JSON).

Below is an example of a filled data-templates.red from which three tables (bank_transaction, banks, transactions) and two JSON arrays (banks and transactions) can be generated.


#[
   section_name: #[
        bank_transaction: #[ ; subsection name is used as a table name
            owner_name: none
            account_id: none
            phone_number: none
            email: none
            banks: [ ; parent of a node that has a direct descendant of another node is used as the name of a table (or JSON array).
              bank: [
                bank_name: none
                branch_code: none
                account_type: none
              ]
            ]
            transactions: [ ; parent of a node that has a direct descendant of another node is used as the name of a table (or JSON array).
                transaction: [
                    amount: none 
                    currency: none 
                    tid: none
                ]
            ]
      ]

  ]        
]

The tags-matching-rules.red describes a name relationship in an intermediate representation with a sequence of nodes (tag chains) to the target data.

If there are different spellings of the same nodes in XML, they can all be specified (as in the tid example).

All node names must be unique.

Example of filling: tags-matching-rules.red.


section_name:  #[
        owner_name: [
          "data account ownerName" 
        ]
        account_id: [
          "data account accountId"         
        ]
        phone_number: [
          "data account phoneNumber" 
        ]
        email: [
          "data account email" 
        ]
        bank_name: [
          "data account bankInfo bankName" 
        ]
        branch_code: [
          "data account bankInfo branchCode" 
        ]
        account_type: [
          "data account bankInfo accountType" 
        ]
        amount: [
          "data transactions transaction amount" 
        ] 
        currency: [
           "data transactions transaction currency" 
        ] 
        tid: [
          "data transactions transaction transactionID" 
          "data transactions transaction alternativeTransactionIdSpelling" 
        ]
]

If the XML contains nodes whose values we do not need, a file is created in the ignores directory for each section that describes the sequence of nodes to be ignored.

An example of the contents of section_name.txt:


["data" "account" "miscInfo"]
["data" "account" "preferredLanguage"]
["data" "account" "themePreference"]
["data" "transactions" "transaction" "networkProvider"]

The grow-rules.red specifies the names of nodes in the intermediate representation and their XML spelling variants. Without this rule, when nodes are encountered, the application will skip them.

Correct node completion is due to the grow-rules.red rule:


section_name: [
   transaction: ["transaction" "transactionSpellingA" "transactionSpellingB"] 
   bank: ["bankInfo"]
]

An example of JSON generated without grow-rules.red rule:


{
  "owner_name": "John Doe",
  "account_id": "94967295",
  "phone_number": "+1234567890",
  "email": "johndoe@example.com"
}

An example of SQL generated without grow-rules.red rule:


INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
  VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');

JSON generated with grow-rules.red:


{
   "owner_name": "John Doe",
   "account_id": "94967295",
   "phone_number": "+1234567890",
   "email": "johndoe@example.com",
   "banks": [
      {
         "bank_name": "Bank of America",
         "branch_code": "001",
         "account_type": "Personal"
      }
   ],
   "transactions": [
      {
         "amount": "1200.00",
         "currency": "USD",
         "tid": "TX1003047"
      },
      {
         "amount": "2700.00",
         "currency": "USD",
         "tid": "TX1003048"
      }
   ]
}

Generated SQL:


INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
  VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');

INSERT INTO banks ("bank_name", "branch_code", "account_type")
  VALUES ('Bank of America', '001', 'Personal');

INSERT INTO transactions ("amount", "currency", "tid")
  VALUES ('1200.00', 'USD', 'TX1003047'),
         ('2700.00', 'USD', 'TX1003048');

In case the nesting of nodes is not specified in this way, but described based on the XML 1to1 structure, i.e. omit banks:


#[
   section_name: #[
        bank_transaction: #[
            owner_name: none
            account_id: none
            phone_number: none
            email: none
            bank: [
                bank_name: none
                branch_code: none
                account_type: none
              ]
              transactions: [
                transaction: [
                    amount: none 
                    currency: none 
                    tid: none
                ]
              ]
      ]

  ]        
]

Then when generating SQL and JSON, the contents of banks will be ignored:

JSON example:


{
   "owner_name": "John Doe",
   "account_id": "94967295",
   "phone_number": "+1234567890",
   "email": "johndoe@example.com",
   "transactions": [
      {
         "amount": "1200.00",
         "currency": "USD",
         "tid": "TX1003047"
      },
      {
         "amount": "2700.00",
         "currency": "USD",
         "tid": "TX1003048"
      }
   ]
}

Example SQL:


INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
  VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com'),
          ('Bank of America', '001', 'Personal');
  
INSERT INTO transactions ("amount", "currency", "tid")
  VALUES ('1200.00', 'USD', 'TX1003047'),
          ('2700.00', 'USD', 'TX1003048');

If the contents of banks do not need to be stored in a separate table (or JSON array), the banks node should be omitted:


#[
   section_name: #[
        bank_transaction: #[
            owner_name: none
            account_id: none
            phone_number: none
            email: none

            bank_name: none
            branch_code: none
            account_type: none

            transactions: [
                transaction: [
                    amount: none 
                    currency: none 
                    tid: none
                ]
              ]
      ]

  ]        
]

JSON:


   {
      "owner_name": "John Doe",
      "account_id": "94967295",
      "phone_number": "+1234567890",
      "email": "johndoe@example.com",
      "bank_name": "Bank of America",
      "branch_code": "001",
      "account_type": "Personal",
      "transactions": [
         {
            "amount": "1200.00",
            "currency": "USD",
            "tid": "TX1003047"
         },
         {
            "amount": "2700.00",
            "currency": "USD",
            "tid": "TX1003048"
         }
      ]
   }

SQL:


INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email", "bank_name", "branch_code", "account_type")
  VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com', 'Bank of America', '001', 'Personal');
  
INSERT INTO transactions ("amount", "currency", "tid")
  VALUES ('1200.00', 'USD', 'TX1003047'),
         ('2700.00', 'USD', 'TX1003048');

Let's restore the original data-templates.red structure:


   #[
      section_name: #[
           bank_transaction: #[ ; subsection name is used as a table name
               owner_name: none
               account_id: none
               phone_number: none
               email: none
               banks: [ ; parent of a node that has a direct descendant of another node is used as the name of a table (or JSON array).
                 bank: [
                   bank_name: none
                   branch_code: none
                   account_type: none
                 ]
               ]
               transactions: [ ; parent of a node that has a direct descendant of another node is used as the name of a table (or JSON array).
                   transaction: [
                       amount: none 
                       currency: none 
                       tid: none
                   ]
               ]
         ]
   
     ]        
   ]

injection-rules.red allows you to pass values from parents to descendants:


section_name: [
  enumerate-nodes: [] 
  inject-tag-to-every-children: [account_id] 
  injection-tag-and-recipients: []
]

When this rule is used, account_id will be passed to all descendants a level below.

JSON:


   {
      "owner_name": "John Doe",
      "account_id": "94967295",
      "phone_number": "+1234567890",
      "email": "johndoe@example.com",
      "bank_name": "Bank of America",
      "branch_code": "001",
      "account_type": "Personal",
      "transactions": [
         {
            "account_id": "94967295",
            "amount": "1200.00",
            "currency": "USD",
            "tid": "TX1003047"
         },
         {
            "account_id": "94967295",
            "amount": "2700.00",
            "currency": "USD",
            "tid": "TX1003048"
         }
      ]
   }

SQL:


INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
  VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');

INSERT INTO banks ("account_id", "bank_name", "branch_code", "account_type")
  VALUES ('94967295', 'Bank of America', '001', 'Personal');

INSERT INTO transactions ("account_id", "amount", "currency", "tid")
  VALUES ('94967295', '1200.00', 'USD', 'TX1003047'),
         ('94967295', '2700.00', 'USD', 'TX1003048');

It is possible to pass tags only to specific descendants.

injection-rules.red:


section_name: [
  enumerate-nodes: [] 
  inject-tag-to-every-children: [] 
  injection-tag-and-recipients: [
     account_id: [transaction] 
     owner_name: [bank transaction]
  ]
]

In this case, account_id will only be passed to transactions and owner_name to the banks and transactions node.

JSON:


{
   "owner_name": "John Doe",
   "account_id": "94967295",
   "phone_number": "+1234567890",
   "email": "johndoe@example.com",
   "banks": [
      {
         "owner_name": "John Doe",
         "bank_name": "Bank of America",
         "branch_code": "001",
         "account_type": "Personal"
      }
   ],
   "transactions": [
      {
         "owner_name": "John Doe",
         "account_id": "94967295",
         "amount": "1200.00",
         "currency": "USD",
         "tid": "TX1003047"
      },
      {
         "owner_name": "John Doe",
         "account_id": "94967295",
         "amount": "2700.00",
         "currency": "USD",
         "tid": "TX1003048"
      }
   ]
}

SQL:


INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
  VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');

INSERT INTO banks ("owner_name", "bank_name", "branch_code", "account_type")
  VALUES ('John Doe', 'Bank of America', '001', 'Personal');

INSERT INTO transactions ("owner_name", "account_id", "amount", "currency", "tid")
  VALUES ('John Doe', '94967295', '1200.00', 'USD', 'TX1003047'),
         ('John Doe', '94967295', '2700.00', 'USD', 'TX1003048');

Note that it is the name of the end nodes in the intermediate representation that must be specified. So not transactions but transaction, not banks but banks.

We also have the possibility to number multiple nodes by inserting the value: item_number.

The rule will look like this:


section_name: [
  enumerate-nodes: [transaction] 
  inject-tag-to-every-children: [] 
  injection-tag-and-recipients: []
]

Result:


{
   "owner_name": "John Doe",
   "account_id": "94967295",
   "phone_number": "+1234567890",
   "email": "johndoe@example.com",
   "banks": [
      {
         "bank_name": "Bank of America",
         "branch_code": "001",
         "account_type": "Personal"
      }
   ],
   "transactions": [
      {
         "item_number": 1,
         "amount": "1200.00",
         "currency": "USD",
         "tid": "TX1003047"
      },
      {
         "item_number": 2,
         "amount": "2700.00",
         "currency": "USD",
         "tid": "TX1003048"
      }
   ]
}

SQL:


INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
  VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');

INSERT INTO banks ("bank_name", "branch_code", "account_type")
  VALUES ('Bank of America', '001', 'Personal');

INSERT INTO transactions ("item_number", "amount", "currency", "tid")
  VALUES (1, '1200.00', 'USD', 'TX1003047'),
         (2, '2700.00', 'USD', 'TX1003048');

If it is required to consider uniqueness restrictions on database columns, the db-constraints-rules.red rule is used:


section_name: [
    transactions: [account_id] 
    banks: [bank branch_code]
]

JSON is not affected by it:


{
   "owner_name": "John Doe",
   "account_id": "94967295",
   "phone_number": "+1234567890",
   "email": "johndoe@example.com",
   "banks": [
      {
         "bank_name": "Bank of America",
         "branch_code": "001",
         "account_type": "Personal"
      }
   ],
   "transactions": [
      {
         "amount": "1200.00",
         "currency": "USD",
         "tid": "TX1003047"
      },
      {
         "amount": "2700.00",
         "currency": "USD",
         "tid": "TX1003048"
      }
   ]
}

However, SQL generation generates a query subject to uniqueness constraints:


INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
 VALUES ('John Doe', '94967295', '+1234567890', 'johndoe@example.com');

INSERT INTO banks ("bank_name", "branch_code", "account_type")
 VALUES ('Bank of America', '001', 'Personal')
 ON CONFLICT ("bank", "branch_code")
 DO UPDATE SET "bank_name" = EXCLUDED."bank_name", "branch_code" = EXCLUDED."branch_code", "account_type" = EXCLUDED."account_type";

INSERT INTO transactions ("amount", "currency", "tid")
 VALUES ('1200.00', 'USD', 'TX1003047'),('2700.00', 'USD', 'TX1003048')
 ON CONFLICT ("account_id")
 DO UPDATE SET "amount" = EXCLUDED."amount", "currency" = EXCLUDED."currency", "tid" = EXCLUDED."tid";

By default, all values from XML are extracted as text values.

Type conversion rules are defined in tags-casting-rules.red.


section_name: [
    integer-list: [[to-integer to-float] [account_id amount]] 
    float-list: [[to-float] []] 
    bool-list: [[reduce to-word] []]
]

The rules of type conversion are discussed in more detail in the article type-conversion-in-xml-documents.

JSON:


{
   "owner_name": "John Doe",
   "account_id": 94967295,
   "phone_number": "+1234567890",
   "email": "johndoe@example.com",
   "banks": [
      {
         "bank_name": "Bank of America",
         "branch_code": "001",
         "account_type": "Personal"
      }
   ],
   "transactions": [
      {
         "amount": 1200,
         "currency": "USD",
         "tid": "TX1003047"
      },
      {
         "amount": 2700,
         "currency": "USD",
         "tid": "TX1003048"
      }
   ]
}

SQL:


INSERT INTO bank_transaction ("owner_name", "account_id", "phone_number", "email")
  VALUES ('John Doe', 94967295, '+1234567890', 'johndoe@example.com');

INSERT INTO banks ("bank_name", "branch_code", "account_type")
  VALUES ('Bank of America', '001', 'Personal');

INSERT INTO transactions ("amount", "currency", "tid")
  VALUES (1200, 'USD', 'TX1003047'),
         (2700, 'USD', 'TX1003048');

The rules of complex-extract-rules.red are described in article type-conversion-in-xml-documents

config.txt structure:


; allowed combinations: true false "..." none

root-xml-folder: "" ; if xml_path in DB is relative
xml-filling-stat: true ; table: filling_percent_stat should exists
ignore-namespaces: true
ignore-tag-attributes: true
use-same-morphology-for-same-file-name-pattern: false
skip-schema-version-tag: true
use-same-morphology-for-all-files-in-folder: false
delete-data-before-insert: none
connect-to-db-at-project-opening: false
source-database: "PostgreSQL" ; available values: PostgreSQL/SQLite
target-database: "PostgreSQL" ; available values: PostgreSQL/SQLite/NoSQL

bot-chatID: ""
bot-token: ""
telegram-notifications: none

db-driver: ""
db-server: "127.0.0.1"
db-port: "5432"
db-name: "test"
db-user: "postgres"
db-pass: ""

sqlite-driver-name: none
sqlite-db-path: none

nosql-url: ""
append-subsection_name-to-nosql-url: true
nosql-login: "root"
nosql-pass: ""

job.txt is necessary if references to XML are located in the database


job-number: 1 ; default where condition block number
xml-table-name: "xml_files" ; name of table where full path to XML is store
limit: 1000 
order_by: "id DESC" 
select-fields: [
    id xml_path ; columns id and xml_path is required in xml_files table!
] 
where-condition: [
   [
    job_numnber: [1 2 3]
    doc_date: [2024-01-01 2030-01-01] ; date range
    region: ["USA"]
    doc_type: "type_a" ; "type_a" "type_b" "type_c"
    parsing_status: "NULL" ; "NULL" "NOT NULL" "someWord" or logical: true\false
    action_type: "data_insert" ; or data_update if doc only patch table
   ]          

   [
    job_numnber: [5 6 7]
    doc_date: [2024-01-01 2030-01-01] ; date range
    region: ["Europe" "Australia"]
    doc_type: "type_a" ; "type_a" "type_b" "type_c"
    parsing_status: "NULL" ; "NULL" ; "NULL" "NOT NULL" "someWord" or logical: true\false
    some_other_column: "NULL" ; "NULL" ; "NULL" "NOT NULL" "someWord" or logical: true\false
    action_type: "data_insert" ; or data_update if doc only patch table
   ]  
]

The structure of the assignments is described in more detail in the article jobs-settings