LibXML

Introduction

The LibXML library is used to read the contents of XML files and strings extracted from websites via the LibSoup library.

I used the libxml examples that was supplied with the Seed package. For reference, the XML file used, which is called sample.xml, is as follows:

<?xml version=”1.0″?>
<story attribute=”test”>
  <storyinfo>
    <author>John Fleck</author>
    <datewritten>June 2, 2002</datewritten>
    <keyword example=”yes”>example keyword</keyword>
  </storyinfo>
  <body>
    <headline>This is the headline</headline>
    <para>This is the body text.</para>
  </body>
</story>

Rather than jumping in and using a code example to explain how libxml works, firstly I would tinker with it in the Seed shell. In Seed shell, enter:

xml = imports.libxml;

This imports the libxml library.

doc = xml.parseFile(“sample.xml”);

This loads the content of the sample.xml file into the XMLDocument object called doc. You can browse the functions you can use with this object using the following JavaScript command:

for (a in doc) print (a);

Upon doing this, you get the following results:

next
doc
root
prev
name
children
last
type
parent
xpathNewContext

The above a all functions of the doc object. They can be used as for example.

doc.root
doc.name
doc.type

Now that the contents of the XML file is contained in the object variable doc, you can access the information contained within in a number of ways. The first way is going through it’s tree structure.

TREE STRUCTURE

All XML files contain elements. These elements are used to store and to identify data and are usually all on one line. To use the above example, the following shows the headline tag.

<headline>This is the headline</headline>

All elements are split into 3 parts. First, the start tag, the second the content and lastly the end-tag which is identified by having a / prefix.

The start-tag is headline is as follows:

<headline>

The content of headline is as follows

This is the headline

The end-tag of headline:

</headline>

When accessing XML through the tree structure, these items that make up the element are referred to as nodes. While each one of separate items of the elements are nodes, the whole element is also a node. It’s a bit confusing at first but makes sense once you get used to it.

To access these nodes you need to get the root node of the XML tree which is the first element. A valid XML file needs to have a root node (both start and end tags enclosing the rest of the content of file). In sample.xml, the root node, it’s start and end tags is as follows:

<story attribute=”test”> </story>

The story root element has an attribute called attribute which has a value of test. To access this value the following command can be used:

value = root.getAttribute(“attribute”));

Please note, that these attributes can only be access if via the current node. To get access to attributes in other elements, you will have to find the element first.

The root node is the first node of the tree and the nodes after that is referred to as children. The root node which is the currently one accessed, is called the parent node and will always have children. Nodes after will have children unless they are end of the hierarchical level. In that situation the node child will be the null value.

The child of this root node will be the storyinfo’s text content (even though it’s empty). The child of that is the storyinfo element. It has to be noted that the storyinfo is the ‘next level’ child of root. The whitespace that comes before each element in the XML file show these levels. The next child of storyinfo in the above example is <body>.

Depending on what is required, the next child of the parent (currently root) can be accessed which as explained is <body> or the children of the children can be accessed ie the children of <storyinfo> in which <storyinfo> becomes the parent.

If the children of storyinfo is accessed, the first child is the storyinfo context node (even though it is empty) then the author start tag and so on until the last node with the null value as child. The next child of the parent is then accessed and then the children of that is accessed until the last node with null value is child is accessed as soon on.

Further on, there is a Seed JavaScript program which shows the element hierarchy in a visual show that is a bit more understandable although diagrams on other XML tutorials gives better visual understandings.

I’ve amended the JavaScript program from the examples package of Seed to show this.

   1 #!/usr/bin/env seed
   2 xml = imports.libxml;
   3 
   4 var global_counter = 0;
   5 
   6 function print_element (parent, num_spaces){
   7   var child = parent.children;
   8 
   9   while (child)
  10   {
  11     var disp_text = “”;
  12     var filler = “”;
  13 
  14     global_counter ++;
  15     temp_counter = num_spaces;
  16     for (x = 0; x <= num_spaces ; x++)
  17       filler = filler + " ";
  18     disp_text = filler + "Element " + global_counter + ": name:" + child.name + " type:" + child.type + " content:(" + child.content + ")";
  19     print(disp_text);
  20     print(filler + "————————————————–");
  21     temp_counter ++;
  22     print_element (child, temp_counter);
  23     child = child.next;
  24     num_spaces ++;
  25   }
  26 }
  27 
  28 doc = xml.parseFile("sample.xml");
  29 root = doc.root;
  30 print("root= type:" + root.type + ' name:' + root.name + ':' + root.content);
  31 print("==================");
  32 print_element (root,0);

The code is a bit hard to follow as it as recursive looping. By this, the program calls a function which calls itself repeatedly before it finishes. This results in many states of the function before it calls it. This is known as recursion. The function can only finish when a condition is met which in this case is then the null child is encountered. This is shown is the following code from above JavaScript example:

while (child){
…
}

The code is a while loop that runs while the contents of child is not null.

Before the nodes can be accessed, the contents of the XML file needs be read in. Firstly, the libxml libraries need to be read. In the example it is imported into this xml object.

xml = imports.libxml;

The contents of the XML file sample.xml is loaded into a XMLDocument object called doc :

doc = xml.parseFile("sample.xml");

To get the root node, the below command is used which loaded it into a XMLNode object called root.

root = doc.root;

As shown above, the function of the XMLDocument object were shown which can be used with it. This can also be done with the XMLNode object that was used for the root node. If you ran the above commands in Seed shell, you can see the functions by typing the following command:

for (a in root) print (a);

The following appears:

next
doc
prev
name
children
content
last
properties
type
parent
getElementsByTagName
getAttribute

The following JavaScript code shows how it works in practice:

print("root= type:" + root.type + ' name:' + root.name + ':' + root.content);

The functions and a short description is as follows:

root.type = The type of node. When run, it is shown as an element root.name = The name of node defined in XML file. When run, it shows as story. If none given or is another node apart from an element, it is shown as text. root.content = This is the contents of the element, between the start and end tag of element.

The next line is the first call to the print_element function and passes the root node to it.

Inside the print_element function, the children of the passed parent node (which in this instance is root) is created in the XMLNode object child by

var child = parent.children;

The while loop which is the 'main engine' of the recursive loop is called. Inside it, there is an if statement:

if (child.type == "element")

This is to check to see if the current node (child) is an element. If it is an element then it is displayed and the function is recursively called again calling the child of that until a child null node is reached)

If the node is not an element, then next node on the level of the child is set by the following:

child = child.next;

Please note that the previous node can be accessed via child.prev;

The program also includes functionality to display white space in front of each output to give a approximate hierarchical level of each node.

As mentioned earlier, elements are made up of nodes and these are traversed in order to get to the next element. In real life situations, you would have no further use of them apart from traversing to the next element. For reference, I've amended the program to show how their contents:

   1 #!/usr/bin/env seed
   2 xml = imports.libxml;
   3 
   4 var global_counter = 0;
   5 
   6 function print_element (element, num_spaces){
   7   var child = element.children;
   8 
   9   while (child){
  10     var disp_text = "";
  11     var filler = "";
  12 
  13     global_counter ++;
  14     temp_counter = num_spaces;
  15     for (x = 0; x <= num_spaces ; x++)
  16     filler = filler + " ";
  17     disp_text = filler + "Node " + global_counter + ": name:" + child.name + " type:" + child.type + " content:(" + child.content + ")";
  18     print(disp_text);
  19     print(filler + "————————————————–");
  20     temp_counter ++;
  21     print_element (child, temp_counter);
  22     child = child.next;
  23     num_spaces ++;
  24   }
  25 }
  26 
  27 doc = xml.parseFile("sample.xml");
  28 root = doc.root;
  29 print("root= type:" + root.type + ' name:' + root.name + ':' + root.content);
  30 print("==================");
  31 print_element (root,0);

When run, you will see more nodes being displayed. To use the below examples:

Node 4: name:author type:element content:(John Fleck)
————————————————–
Node 5: name:text type:text content:(John Fleck)
————————————————–
Node 6: name:text type:text content:(
)

Node 4 is the element <author>. Node 5 is a node of element <author>. It is the contents of the element and is of type text and it's content is the same as that of it's element. As it is not defined, it has the name of text. Node 6 is the end tag of the element <author>. It's content is empty.

The above is the same for all elements of the XML file. Please note if you look at the output, the last element end tags node is not displayed. The program could be amended to show this but as it is for illustrative purposes, there is little point.

DOM

Another way of accessing element is through DOM (Document Object Model). The function available in Seed to do this is getElementsByTagName

To use the JavaScript program from the examples package of Seed:

   1 #!/usr/bin/env seed
   2 xml = imports.libxml
   3 
   4 doc = xml.parseFile("sample.xml");
   5 
   6 root = doc.root;
   7 storyinfos = root.getElementsByTagName("storyinfo");
   8 element_list_length = storyinfos.length;
   9 for (var i = 0; i < element_list_length; i++){
  10   var info = storyinfos[i];
  11   var keyword = info.getElementsByTagName("keyword")[0];
  12 
  13   Seed.printf ("Story info Keyword (example: %s): %s",
  14   keyword.getAttribute("example"),
  15   keyword.content);
  16 }

The first part is obvious by getting the root node of XML file sample.xml:

xml = imports.libxml
doc = xml.parseFile("sample.xml");
root = doc.root;

The following line gets a list of all elements that have the tag name specified. Please note that only the children of root is searched (and also that of any parent). Using the example.xml file, the only elements that will be found is storyinfo and body.

storyinfos = root.getElementsByTagName("storyinfo");

The results is put into a list of XNLNodes object called storyinfos. This contains all the elements with the name storyinfo. As there is only one element, the size of the list is 1. If there is more than one, it will be more. The next line of code gets the number of nodes in list:

storyinfos = root.getElementsByTagName("storyinfo");

Next an if statement is initiated to go through the list of nodes. As we know from our example there is only one element but in real life situations, the if statement would be used as we don't know how many elements there is:

for (var i = 0; i < element_length; i++){

Next the current node in the storyinfos list is set to info XMLNode object.

var info = storyinfos[i];

The next line is to search the children of the current nodes in info for an element called keyword. In the below code, for illustrative purposes, we know there is only one element so we extract a node (by putting [0]) at end) rather than a list of XMLNodes as above. In real life this would not be used as we wouldn't know if there is more than one.

var keyword = info.getElementsByTagName("keyword")[0];

The last two lines, gets the attribute and displays it's content

Seed.printf ("Story info Keyword (example: %s): %s",
  keyword.getAttribute("example"),
  keyword.content);

For those new to C, the details of keyword as printed using printf and format specifiers %s which in displayed in the line of text to be printed and substituted before output with the two parameters keyword.getAttribute("example") & keyword.content.

While the above example is for illustrative purposes it would only be useful in fixed size XML files where the elements would always be the same in for instance config files. If the XML files was differing in size like for instance a Word processing XML document where the number of differing elements are not known, then recursive function style code like the example for tree travesal would have to be used.

Xpath Context

The third way of accessing the contents of the XML file is using a Xpath Context. This would be used if the structure elements is known and is a fixed length i.e. config files that your prgram uses (I am new to XML myself but I'd assume there would be occasions where the XML file structure is not know i.e. if it was generated by another program so one of the types of access would have to be used).

Again, using Javascript code from the example part of the Seed package:

   1 #!/usr/bin/env seed
   2 xml = imports.libxml;
   3 doc = xml.parseFile("./sample.xml");
   4 ctx = doc.xpathNewContext();
   5 results = ctx.xpathEval("//story/body/headline");
   6 
   7 print("Headline: " + results.value[0].content);

As always the lines open the xml file and store it into the doc object variable:

xml = imports.libxml;
doc = xml.parseFile("./sample.xml");

The next line sets the XML file stored in doc object into XMLXPathContext path called ctx and xpathEval is used to get access to the element specified. The result is a list of XMLXPathObj objects. Please note, the in xpathEval error has to be exactly the structure of the XML file from root to the child under it i.e. body is child of root and headline is child of body.

ctx = doc.xpathNewContext();
results = ctx.xpathEval("//story/body/headline");

The code displays the content of the list.

print("Headline: " + results.value[0].content);

As there is only one element called headline, results.value[0] is used. If there was more than one, the following command would be used.

len_list = results.value.length;