XPath and XPointer/XPath in Action
Taken on its own terms, as a teaching tool, XPath might not seem to meet the test for a practical standard: it's useful only in the context of some other standard. How do you demonstrate something like XPath without requiring the novice to learn that other standard as well? Luckily, several tools have emerged to simplify this task. These tools allow you to enter and modify an XPath expression — typically, a full location path — returning to you in some highlighted form a selected portion of a target document. (The portion in question might or might not be contiguous, of course, depending on how exotic the location path is.) In this chapter, I'll demonstrate XPath using a tool called XPath Visualiser, developed by Dmitre Novatchev.
XPath Visualiser can be downloaded from the VBXML site, at http://www.vbxml.com.
XPath Visualiser: Some Background
XPath Visualiser runs under Microsoft Windows, from Windows 95 on up, and is built on top of the Microsoft MSXML XML/XSLT processor included with the Internet Explorer browser. This operating environment for the tool implies some advantages and disadvantages to its use.
An important practical advantage of this tool is that the results are visual. As we go through the examples in this chapter, you'll be able instantly to see the effects — subtle or grand — of changes in XPath expressions. (You don't even need to use Windows, let alone XPath Visualiser itself, because all these effects are captured in screen shots for you.) Trying to explain verbally what an XPath expression "does" is a convenient way to extend a book's length, but it's not simple, and it's prone to misinterpretation. (A picture of an XPath expression is worth a thousand words of description.)
Next, because XPath Visualiser uses a current version of the MSXML processor, its "understanding" of the XPath Recommendation is complete. If an expression is legal under the terms of that standard, you can illustrate it with XPath Visualiser.
Interestingly, though, a significant disadvantage of using XPath Visualiser is also that it's based on MSXML. That's because MSXML supports not only the current versions of XPath and XSLT, but also an early version of XSLT (called plain-old XSL). I described this early version in Chapter 1 and Chapter 2. Among the differences in this "backward-compatible" XSL processor is that it included numerous Microsoft-only capabilities; for example, you could use their version of what became XPath to select a valid document's document type declaration. (Note that this isn't a problem with XPath Visualiser itself, which deals only with true-blue XPath; it may be something to consider if you're planning to use MSXML for other purposes of your own.)
XPath Visualiser is not a "program" per se. It's a plain-old frames-based set of HTML documents and a customized version of Microsoft's default XSL(T) stylesheet, which work only when viewed through Internet Explorer Versions 5 and up. (More precisely, it works only with MSXML Versions 3 and up. Internet Explorer 5 and 5.5 do not come with MSXML 3, although you could download and install MSXML 3 to run under them. Internet Explorer 6 comes with the next version of MSXML, Version 4.) Figure 5-1 shows a portion of how the browser window appears when you first open this frameset.
I've suppressed all toolbars except the standard one, to give me as much screen real estate as possible for displaying actual documents. (I've also tweaked the XPath Visualiser default stylesheet; as distributed, the tool displays the document's contents against a pale-blue background, which reproduces poorly in grayscale screen shots.) As you can see, the upper frame includes a number of user-interface controls for specifying the document to be viewed and the location path to be tested or demonstrated. By default, the location path is:
which selects all element nodes in the loaded document. When you've loaded a document using the controls at the top of this frame and clicked on the Select Nodes button, the nodes your location path has selected are highlighted in any of various ways. (The buttons labeled Variables and Keys have to do with XSLT processing and will not be covered here.) The document itself appears in the bottom frame; its display is an enhanced version of the default MSXML/Internet Explorer view of XML documents, showing the document as an expandable/collapsible "tree" of nodes. Because there's no document loaded when you first fire up XPath Visualiser, the main document window initially displays the simple text "XML source document." Once you load a document and specify a location path, the lower frame changes in a manner resembling Figure 5-2.
The lower frame of the window in Figure 5-2 contains an XML document used in Chapter 3. There are a couple important things to observe about this changed display. First, XPath Visualiser's interface includes a series of "VCR buttons" (the series of arrowheads beneath the location path in the top frame), which you can use to step through a selected node-set. These VCR buttons are labeled to indicate which node in the node-set is currently selected and how many nodes are in the node-set altogether ("0 of 22/22 matches," in this case). Second, the node(s) selected by the location path in the top frame are highlighted in the lower frame. This highlighting appears in Figure 5-2 and throughout the rest of this chapter, as a bordered, pale gray background (In the case of elements, as you can see, only the start tags are highlighted.) Finally, note the small vertical black bars to the left of certain elements' start tags. On screen, these are simply shaded + and - signs, placed there to expand and collapse the tree of nodes descending from elements that have descendants. (As you can see, the name and price elements' start tags don't have these black bars, since they don't have expandable/collapsible sub trees.)
For the remainder of the screen shots in this chapter, I'll simply show portions of the lower frame, preceded by the location path in regular code-style font such as:
This will enable me to show larger portions of the document.
Sample XML Document
To keep a consistent base for all the example location paths in this chapter, I'll refer to the same XML source document. This document is short but contains at least one of every XPath node type:
<!-- Basic astrological data for T's and J's signs --> <?xml-stylesheet type="text/xsl" href="astro.xsl"?> <astro xmlns:xlink="http://www.w3.org/1999/xlink"> <sign start-date="12-22" end-date="01-20"> <name type="main">Capricorn</name> <name type="alt">The Sea-Goat</name> <!-- capricorn.gif corresponds to Unicode 3.0 #x2651 --> <symbol xlink:type="simple" xlink:href="capricorn.gif"/> <ruling_planet>Saturn</ruling_planet> <ruling_planet>Earth</ruling_planet> <energy>Feminine</energy> <quality>Cardinal</quality> <anatomy> <part>Bones</part> <part>Knees</part> </anatomy> </sign> <sign start-date="05-21" end-date="06-22"> <name type="main">Gemini</name> <name type="alt">The Twins</name> <!-- gemini.gif corresponds to Unicode 3.0 #x264A --> <symbol xlink:type="simple" xlink:href="gemini.gif"/> <ruling_planet>Mercury</ruling_planet> <element>Air</element> <energy>Feminine</energy> <quality>Mutable</quality> <anatomy> <part>Hands</part> <part>Arms</part> <part>Shoulders</part> <part>Lungs</part> </anatomy> </sign> </astro>
The document in question describes elementary properties of two of the Western-style astrological signs, Capricorn and Gemini. When first loaded into XPath Visualiser with the default "all elements" location path selected, it appears as shown in Figure 5-3.
General to Specific, Common to Far-Out
I'll start out with some fundamental location paths, such as those selecting elements of a particular name, and move on to some special cases (such as examples using axes and predicates). The chapter will include a number of bizarre location paths probably unlike any you'd actually use, but at least theoretically (if not practically!) legitimate. Along the way, I'll poke into XPath functions, numeric operators, and so on. Each screen shot of XPath Visualiser's lower frame is accompanied by a brief English-language description of what's depicted.
(If you're feeling sufficiently adventurous, you might want to guess what the location paths select before looking at the corresponding screen shots.)
The Node Test
As a reminder, XPath is capable of locating the following seven types of nodes: root; element, attribute, comment, PI, namespace, and text. There's also a special node( ) "node test," which locates nodes of any type along the selected axis. I'll cover the attribute and namespace node types in a moment, but for now, here's how XPath (via XPath Visualiser) selects on the other types.
The simplest of these is, of course, the root node itself. The location path to the root node consists of a single slash:
XPath Visualiser depicts the result as shown in Figure 5-4.
Actually, the first thing you see when selecting on the simple / location path is an error message; only after clearing this error message are you greeted by the above. XPath Visualiser seems not to know how to visually represent the root node — not that I know how to, either!
Of course, as in Figure 5-2 and Figure 5-3, you've already seen the results of selecting all elements in the document. Figure 5-5 is based on a location path identifying specific elements: the part elements, in this case:
Notice how the highlighting has shifted; only those element nodes whose names are "part" are now selected. The sample document contains three comments. To select them, use the following (the results are as shown in Figure 5-6):
There's only one PI in the sample document, which is the xml-stylesheet PI in the document's prolog. You can select it using either of the following two location paths. In either case, the result is the same, as shown in Figure 5-7.
//processing-instruction( ) //processing-instruction("xml-stylesheet")
To select all text nodes, use:
This isolates all text nodes in the document; see Figure 5-8 for XPath Visualiser's depiction.
Finally, to select all element, comment, PI, and text nodes in a single step, use the node( ) special node type:
See Figure 5-9 for the result.
One interesting note about these results is that — as discussed in Chapter 3 — neither attribute nor namespace nodes are "visible" (or highlighted in Figure 5-9) along the default child:: axis. To access either, you must employ the attribute:: or namespace:: axis, respectively. For instance, either of the following works to select all attributes in the sample document:
As you can see in Figure 5-10, XPath Visualiser selects the attributes as complete name-value pairs.
Namespace nodes are a special case, in XPath Visualiser as in most other contexts. As the README file accompanying the utility says:
This tool will not display selected nodes that were not explicitly specified in the text of the xml source document. Most notably this is true for (propagated) namespace nodes . . . .However, the containing nodes are still [highlighted].
That is for namespace nodes, XPath Visualiser does not highlight all elements within scope of the element declaring a given namespace, but only the declaration within the declaring element itself. For instance, this location path:
results in a display like Figure 5-11.
In Figure 5-11, I've shown the "X of Y/Z matches" information in the upper frame. For other node types, the Y value in this phrase equals the Z. For namespace nodes, though, XPath Visualiser sets Y equal to the number of namespace-declaring elements matching the location path and Z equal to the number of elements within scope of the selected namespace declarations. If you refer back to the full code listing, you will see that (as Figure 5-11 shows) there are 26 elements, including the astro element itself, within scope of the astro element's declaration of the xlink namespace.
(Remember, by the way, the built-in namespace associated with all XML documents, the one bound to the xml: prefix. If you change the above location path to:
XPath Visualiser changes the "Z" in "X of Y/Z" to 52 — that is, 26 namespace nodes for the xlink namespace and 26 for the xml namespace.)
Finally, to select a document's entire contents, you'd use a compound location path:
//node( ) | //@* | //namespace::*
Figure 5-12 depicts the result.
Previous examples have already demonstrated some of the simpler axes, that is, the child::, attribute::, and namespace:: axes. (Many of the previous examples also demonstrated, without explicit comment, the use of the descendant-or-self:: axis, as abbreviated //.) Let's take a look at some of the other axes now. Note that to use many of these other "family relationships," we'll typically use one or more location steps to navigate to some particular node-set in the document followed by a location step, which "turns the viewpoint" along the axis in question.
The parent:: axis, usually abbreviated .., looks "up" from the context node one level in the document's tree of nodes. A location path like:
locates all parent elements of any elements named "part" — in the case of our sample document, as shown in Figure 5-13, the two anatomy elements.
The parent of an attribute, comment, text node, or PI is the element that contains it (or, for comments and PIs in the document prolog, the root node). So:
(as you can see in Figure 5-14) selects all attributes of all elements that are parents of (that is, that contain) any comment nodes.
An important concept this screen shot illustrates is that although a full location path may contain references to many nodes at many levels of the document tree, only the final location step in the path identifies the nodes that will actually be selected. Here, neither the comment nodes nor their parents are highlighted by XPath Visualiser. As the final location step in the path indicates, only the attributes of those parents are ultimately selected.
XPath does not define a simple sibling:: axis; to get all siblings of a given node, you must use the preceding-sibling:: and following-sibling:: axes together. Something like this (note that this is a single compound location path wrapped over two lines):
//processing-instruction("xml-stylesheet")/preceding-sibling::node( ) | //processing-instruction("xml-stylesheet")/following-sibling::node( )
This selects the siblings of the xml-stylesheet PI, as shown in Figure 5-15.
As the xml-stylesheet PI is located in this document's prolog, it has one preceding sibling (the opening comment) and one following (the document's root astro element).
As discussed in Chapter 3, the preceding:: and following:: axes locate nodes that terminate before or begin after (respectively) the full scope of a given node's markup. They differ from preceding-sibling:: and following-sibling:: in not requiring a "shared parent" condition. The following location path:
as you can see in Figure 5-16, selects not only that quality element's anatomy sibling, but also that anatomy element's children and all the other elements that follow the close of the quality element — even those otherwise unrelated (except distantly) to it.
The ancestor:: and descendant:: axes, of course, restrict the view from a given node to the same branch of the family tree in an up or down direction, respectively. Thus:
locates (as shown in Figure 5-17) the anatomy parent of that part element, the sign parent of that anatomy element, and the astro parent of that sign element. (It also locates the root node but, as explained earlier, XPath Visualiser has no way to "highlight" the root node.)
Adding the -or-self qualifier to the ancestor:: or descendant:: axis, on the other hand, selects not only that chain of parents but also the context node itself. The location path:
thus adds to the node-set selected by the preceding example the indicated part element itself. Figure 5-18 illustrates.
Chapter 3 noted that while the axis "turns the view" in a particular direction from the context node, to further refine the list of nodes to be selected from among all those visible in that direction you must use a predicate. For instance:
selects only those name elements whose type attributes have the value alt. As you can see from Figure 5-19, this prunes the node-set of all name elements in the document down to just two — the ones whose string-values are The Sea-Goat and The Twins — and excludes those (string-values Capricorn and Gemini) whose type attributes have some value other than alt.
While on the subject of selecting via attribute values, by the way, this might be a good moment to illustrate the different effects produced by two similar but not identical predicates. First, consider this location path:
Figure 5-20 shows how this selects all elements in the source document whose type attribute does not equal alt.
Only four elements have a type attribute, and of those only two do not have the indicated value.
Chapter 4, under the discussion of the not( ) function, described how in some cases it seemed "obviously" to but did not actually perform identically to the != "not equal to" operator. That is, the preceding location path behaves differently from the following:
As you can see from Figure 5-21, this location path selects all element nodes which do not have a type attribute whose value is alt — including all element nodes with no type attribute at all. Quite a difference!
Arguably the most common predicate test is one that selects nodes from among a candidate node-set based on their positions within that node-set. You can use the position( ) function for this test; when simply testing for a single specific position, you can use the literal position number (or an expression that evaluates to a number) as the predicate. Thus, the following two location paths are identical:
//*[position( )=3] //*
Using the sample document as its source, XPath Visualiser displays the result shown in Figure 5-22.
As you can see, the result is a little surprising. The location path doesn't select simply the third element in the document; it selects the third child element of every element in the document. (As long as the parent element has at least three children, of course. Elements with fewer than three children have none of their children selected. The location path //* might be read as, "Locate all elements in the document whose position along the (default) child:: axis equals 3.")
Also remember that a node's position along a given axis depends on the axis's direction, forward or reverse. In particular, the ancestor::, ancestor-or-self::, preceding::, and preceding-sibling:: axes are reverse axes. All others (except the special case self:: axis, for obvious reasons) are forward axes. The position is counted starting at the context node and proceeding in the direction of the axis towards the beginning of the document (reverse axes) or the end of the document (forward axes). Consider this location path:
XPath Visualiser selects the first preceding sibling of the Mutable quality element in reverse document order, as you can see in Figure 5-23.
If all axes were in the forward direction only, the preceding location path would have located the first Gemini name element — that is, the first preceding sibling in document order of the Mutable quality element. If you want to get the first node in document order when using a reverse axis, don't use the absolute position 1 in the predicate; use the last( ) function, as here:
Now XPath Visualiser (or any other XPath 1.0-compliant processor) will indeed select that first Gemini name element, as shown in Figure 5-24.
For the most part, as explained in Chapter 4, XPath functions are useful primarily in the predicates of location steps. They serve to narrow the focus to particular nodes in a candidate node-set in ways that can't be tested directly, for example, by checking the nodes' string-values.
Among the node-set functions, the most esoteric are probably those having to do with namespaces. Still, these can be useful in ways completely unapproachable by any other means. In our sample document, we've got both an href pseudoattribute (on the xml-stylesheet PI) and a couple of xlink:href attributes (on the symbol elements). Because the strings "href" and "xlink:href" are clearly not equal — and because a PI's pseudoattributes are invisible along the regular attribute:: axis — it might seem impossible to construct a location path that locates all hyperlink references in the document (assuming all such references appear either in a PI or as values of xlink:href attributes) with a compound location path such as:
//@*[local-name()="href"]/.. | //processing-instruction( )[contains(., "href")]
This location path, applied to our sample document, locates the nodes shown in Figure 5-25.
Note in Figure 5-25 that the local-name( ) function serves to strip the namespace prefix from the attributes associated with the xlink: namespace. Testing for some value in the PI requires use of a string function, like contains( ) here, because everything except the PI's name itself is considered (in XPath terms) one big string-value.
The Boolean XPath functions, boolean( ) can be used to explicitly test for the very existence of a node, especially relative to the context node. For instance:
selects all elements in the document that have any child elements at all. This can also be abbreviated, taking advantage of various defaults and shortcuts, to the more enigmatic form:
In either case, the result of applying this location path to our sample document is as shown in Figure 5-26.
When you turn to XPath string functions, you really start to open up the doors to fine-tuned (sometimes almost bizarrely so) location paths. If for some reason you wanted to locate all elements whose string-values began with a capital "M" or ended with a lowercase "e," you could use this location path:
//node()[starts-with(., "M") or substring(., string-length( ), 1)="e"]
(Note that because there's no ends-with( ) function available under XPath 1.0, we have to simulate its purpose using the substring( ) function, starting with position N in an N-length string for a length of one character.)
This location path, applied to our sample document, is processed by XPath Visualiser as shown in Figure 5-27.
As I've said, it's nearly impossible to theorize a portion of an XML document's content that cannot be located with XPath. That said, even some straightforward English-language questions can be answered only by very complex, even bizarre, XPath location steps. And even when the questions can be answered simply, it's possible — if your inclinations run to the perverse — to come up with incredible convolutions of syntax. Here are a couple of examples.
For starters, look back at the sample document of astrological data, particularly at the contents of the part elements (within the two anatomy elements). Note that for any given astrological sign, the text nodes contained by the part elements identify either singular or plural body parts. (Our sample document, as it happens, includes only plurals, such as Bones and Shoulders.) So let's start by asking this English-language question:
What are the main names of all astrological signs with at least one plural part element?
The easiest way to build up a long XPath location path is step by step, confirming that each step along the way does what it needs to do. In this case, the place to start might be at the end of the question: which part elements have plural text nodes? A location path to accomplish this might look something like this:
//part[substring(., string-length( ),1)="s"]
That is: locate all part descendants of the root node, substring the last character in each of their string-values, and select only those for which that substring equals "s." (Of course, this would fail to locate any part element whose string-value is "Teeth." This is not an issue given the two astrological signs in question but be aware of such little wrinkles in making assumptions about your own documents' contents.) Applied to our sample document, XPath Visualiser comes up with the selection shown in Figure 5-28.
Working backwards through our English-language question and comparing it to the sample document structure, the next thing we're evidently seeking is the sign element corresponding to any of the selected "plural body parts" located by the existing location path. As is usual with XPath, there are a number of ways to locate such a sign element. One way would be to use the ancestor:: axis, as here (additional location step boldfaced):
As Figure 5-29 illustrates, the location path now walks the selection back up the document tree to the corresponding sign elements.
The English-language question now says we need to locate the "main names" of all these signs. In terms of the document's structure, this can be interpreted as "all child name element(s) of the selected sign element(s) that have a type attribute whose value is main." Now the location path looks as follows:
Figure 5-30 shows how this location path works in practice.
One further refinement: as you can see from Figure 5-30, the location path as it stands locates the desired name element(s) (with their start tags highlighted by XPath Visualiser). If we really want to locate the main names of the selected signs, we need to locate not the elements themselves, but rather the text nodes that make up their string-values. So our full location path would be:
//part[substring(., string-length(.),1)="s"]/ancestor::sign/name[@type="main"] /text( )
(Note that this location path breaks across two lines here, but actually is a single line for XPath Visualiser's purposes; in XPath's own terms, breaking this expression across two lines like this is quite acceptable.)
In Figure 5-31, as you can see, XPath Visualiser finally answers our original question. It locates that actual name for which we're looking.
One more example, this one based on (perhaps quite unreasonable!) assumptions about the way this document (and any other in the same vocabulary) is structured: each symbol element is immediately preceded by a comment identifying the Unicode 3.0 character corresponding to the image file for that sign's symbol. Also note that a sign may have one or more one or more body parts (Bones and Knees for Capricorn; Hands, Arms, Shoulders, and Lungs for Gemini). Given these assumptions, we might frame a question such as the following: one or more body parts (Bones and Knees for Capricorn; Hands, Arms, Shoulders, and Lungs for Gemini)
What is the name of the image file and the Unicode character equivalent for the symbol of each sign with more than two body parts?
As with the previous example, let's begin at the end of the question by locating all the signs with more than two body parts:
//sign[count(descendant::part) > 2]
Figure 5-32 shows that this selects only one sign element (Gemini).
How to proceed next may seem a little complicated, thanks to the presence in our question of the word "and." All it really means, though, is that we'll be constructing a compound location path. We can work on either the "image file" or the "Unicode character" subordinate location path first; however, because we're going for baroque (sorry) here, let's assume that we want to get to the Unicode character by way of the corresponding image. The image for this sign element can be singled out thus:
//sign[count(descendant::part) > 2]/symbol/@xlink:href
That is, from the selected signs, walk down to their symbol children and then select each symbol's xlink:href attribute. Figure 5-33 illustrates the result.
Now we've got to add a second location path, joined to the first by the union (|, vertical bar or pipe symbol). For this second location path, we're going to navigate down to the same point as the first, but then go back to the preceding comment node:
//sign[count(descendant::part) > 2]/symbol/@xlink:href | //sign[count(descendant::part) > 2]/symbol/@xlink:href/../preceding-sibling::comment( )
An important part of this second location path is the /.. buried within it, which shifts the context for succeeding location steps back up the document tree from the xlink:href attribute, to its parent symbol element. If you omit this location step, the location path attempts to select all preceding siblings of the attribute itself — which is almost never what you want (in answering this question or any other: it always returns an empty node-set).
As you can see in Figure 5-34, we've succeeded in locating all information in the document about the symbols of all signs with more than two body parts.
Figure 5-34. Locating all Unicode and image-file representations of the symbols for all signs with more than two body parts
By the way, although it doesn't matter for this particular sample document, note that the compound location path is susceptible to breaking — returning an incorrect result — in at least one case. If there's more than one comment that is a preceding sibling for a given symbol, the location path will select them all. Thus, to make the location path more robust, you might consider adding a predicate to the final location step, like this:
/comment( )[contains(.,"corresponds to Unicode 3.0")]
Again, adding this predicate has no effect in the case of this particular document. There are other built-in assumptions in the full location path that may or may not be true in other documents in the "astrology markup language." For example, the location path takes it for granted that each symbol element will have an xlink:href attribute; to be even more bullet-proof, the path might choose to ignore symbol elements without that attribute. This depends of course on your application's specific needs. Just remember that as a rule, if you don't cover the unexpected in your location paths, XPath won't cover it for you!