The .NET Platform
Development Tools
COM & COM+
Data Access
Web Development
XML Technologies
Windows Servers
Wireless & Mobile
Security issues
Design & Process
Career Development
Analysis & Comment
Disposable Objects
You are not logged in: login here to access all areas.
Whatever the reason for its popularity, we cannot afford to ignore XML, and neither can Microsoft. As Mark Whitehorn explains, the new version of SQL Server will have several ways of handling XML, some inherited from SQL Server 2000 and some entirely new.
Author: Mark Whitehorn
Last updated: Sep 2004
1. Change engine 2. Translate 3. Use a native XML data type Going native Each field contains a discrete piece of information which is ‘atomic’, by which we mean that it cannot be decomposed into smaller units that still have a meaning for the entity referred to by the row. You might argue that FName could be decomposed into its individual characters, and you’d be right, but the individual characters in isolation play no part in defining a particular customer.To put it another way, a field in a relational database should only contain data that has no internal structure. SELECT fieldnames This is no longer true if a column has internal structure. The XML in that field might have information about, say, the time the customer called us to make a complaint. However, that information is buried within the XML and cannot be sorted using a standard SQL query. Using the XML data type Semi-structured data Sparse data Ordered data Markup data Data that changes type between databases Data that changes meaning within a single system Data that needs to be published in several media Other uses to which you can put the XML data type include: Using the XML data type <order orderyear="2004" orderno="1234" loccode="1234567"> You might create an XML index called ‘ixDetails’ on the XML column: The following could be used to query the data. Note the use of an XPath query embedded within the WHERE clause of the SQL query - note also that this syntax could change prior to the release of SQL Server 2005! SELECT ShippingDate, Details Of course XML data has an internal structure and, as discussed above, this opens up a whole new world of querying and indexing possibilities. For example, you may well want to create what are called secondary XML indexes to speed up particular queries. In this case we are performing a path-based query, so a secondary XML index of type PATH is going to speed things up just a tad. Translation SELECT fieldnames The result is an XML document where column values map to elements or attributes. RAW mode AUTO mode EXPLICIT mode Your options
Whatever the reason for its popularity, we cannot possibly afford to ignore XML, and neither can Microsoft. The new version of SQL Server, previously codenamed ‘Yukon’ and now named SQL Server 2005, will have several ways of handling XML, some inherited from SQL Server 2000 and some entirely new.
If we want to store and manage XML in a database, we have essentially three choices:
We can move to an XML database engine such as dbXML (www.dbxml.com) or Tamino (www.softwareag.com). Here the unit of storage is an XML document; the database engine supports XML querying; and so on. This is a great solution as long as you only want to store XML data. But there is a huge investment in relational databases out there in the world. What if you want to mix XML and relational data?
Alternatively you can use a relational database engine such as SQL Server or DB2, and you can elect to store all of the data therein as relational, no matter what its source. When you need to export data as XML, you use some form of translator which will read relational and write XML. When you need to import, you use a translator that ‘shreds’ the incoming XML data into relational format.
There are extensions to SQL Server 2000 which enable you to do this, or you can download SQLXML 3.0 from the Microsoft site. This is an addition to SQL Server 2000 that adds a great deal of client side support for XML. You can use it to build ADO and ADO.NET clients that can consume data from SQL Server as XML, and you can build XSD schemas that you annotate so that you can map your XML schema to your relational model and have it automatically translated for you.
The third option is to add an XML data type to the relational database engine itself. This is what Microsoft has done with SQL Server 2005.
The rest of this article focuses mainly on the XML data type, but makes some references to the translation mechanism in order to clarify the differences and examine the pros and cons of both mechanisms. However, it is important to realise that they are not simply variations on a theme: they are fundamentally different. In addition, the existing translation process should certainly not be seem as a temporary fix – a kludge added to SQL Server until the XML data type was ready. Both are important ways of handling XML, and both have pros and cons.
Using the XML data type has important implications. Relational databases impose order on data. Database people are (or should be) very meticulous about the way in which data is stored in a relational database. They are very picky about things like primary and foreign keys, and referential integrity. There is, however, a reason for all this pedantry. If you do follow the rules, then your reward is a database that behaves in a highly predictable way. Indeed, facets of its behaviour are guaranteed. Break the rules and you lose one or more of the guarantees.
One of those rules concerns the atomicity of data. A typical relational table might look like this:
CustomerID
LName
FName
MaritalStatus
Gender
DOB
1
Albertson
Sally
M
F
16/03/1967
2
Jones
Bill
M
M
12/08/1976
3
Smith
Jenny
M
F
2/09/1963
4
Bobson
Mike
M
M
12/02/1978
5
Weller
Sarah
S
F
16/02/1975
6
Waller
Richard
S
M
16/08/1978
The problem with XML in this context is that an XML document is useful precisely because it can have a complex internal structure. Furthermore it can contain many repeated elements, which is also a no-no for a field in a relational database.
As soon as we put XML data into a field within a database table then we have broken a rule and we lose some of the predictability that relational databases can give us. As a trivial example, we can use queries to sort data in a table. With atomic data we know that all meaningful sorts can be performed using a query of the general form:
FROM table
ORDER BY fieldname;
When implemented properly, the relational model guarantees a high level of control and predictability. For this to survive the introduction of an XML data type, Microsoft is going to have to develop (and we are going to have to learn) new and different ways of controlling the data.
Please note that I am most definitely not saying that providing an XML data type is bad, wicked or even wrong. I happen to think it is a great idea as it’s going to let us do lots of clever stuff that will directly benefit the business world. But what I am trying to point out is that introducing an XML data type is not a trivial change - it isn’t like adding a new type of integer field or a new char type.
Introducing an XML data type is non-trivial, so there has to be some major gain somewhere along the line otherwise no-one will bother using it. That gain is best expressed in terms of the data that will most benefit from the new data type:
As you might imagine, the XML data type in SQL Server 2005 works really well with ‘semi-structured’ data. A good example might be a note field in a customer table where you aren’t really sure what the structure of the data will be. Each customer would have a standard relational row of data, with date-of-birth data in a date field, name in a VarChar and so on. You could then use an XML data type field to keep notes about conversations you’ve had with them, which could contain semi-structured or unstructured data.
You send out a questionnaire that has 100 questions. Many of the questions are only relevant to some of the people, so on average 25 questions are answered on each of the 100,000 that are returned. In a conventional relational database you would end up with a 100,000 row table with 100 columns, and three quarters of the table would contain null values. If you store the data as XML, you only need represent the data that is actually returned.
Unlike relational data, the data within an XML document is ordered and stays ordered. If X came before Y when you entered the data, then six months later their relative positions are exactly the same – something which cannot be relied upon in a relational database unless you add extra fields. If you are importing data where the position of an item of information relative to its peers is vital, then you should seriously consider storing it in an XML data type. In this case it might well be worth converting relational data into XML simply because that format makes the storage of such information easier.
The XML data type is also well suited to building what amounts to your own document management system. According to my file system, I have written 4,683 Word documents over the last 20 years or so, of which 506 mention SQL Server. I can use the file system to search them, but that is about as far is it goes - and 506 is a huge number of files to trawl through, looking for those that might be relevant.
Suppose I convert all of those documents into XML format and then store them in SQL Server as an XML data type. I can then use XQuery, which allows the querying of XML documents, to narrow the search down to those that are relevant. Yes, I could try it today with full text support, but XQuery gives me so much more control over which parts of the document I am querying and what I am trying to find.
You can apply an optional schema to data that is in XML format. That schema can be used to provide internal type information. Of course, that schema can be different from the schema that someone else, in another database, might apply to the same data. So consider XML for data that needs to change type as it moves between systems.
Once your brain is wrapped around the concept of multiple schemas, it doesn’t take long to realise that you could apply multiple schemas to the same document within a single database, which essentially amounts to late binding of meaning to data.
XML separates the data from the way in which it is presented so, for example, if you have data that may be printed on paper, published to the web and/or sent to mobile devices, then storing it as XML will make that easier.
Assume that we want to store some information about orders as XML. In a table called Order we have a column called ‘Details’ declared to be of type XML. For the sake of argument, the table also has a couple of ‘normal’ columns called ‘ShippingDate’ and ‘ShippingCode’.
A sample of the data in the ‘Details’ field might look like this:
<description>business</description>
<item>
<product>herring</product>
<quantity>2</quantity>
</item>
<item>
<product>haddock</product>
<quantity>4</quantity>
</item>
<price>21.12</price>
</order>
CREATE PRIMARY XML INDEX ixDetails on Order (Details)
FROM Order
WHERE Details.exist ('/order[@orderyear = "2004"]') = 1
CREATE XML INDEX ixDetailsPath on Order (Details)
USING XML INDEX ixDetails FOR PATH
If you don’t want to use the XML data type, then you can store it in relational format and still have it delivered to users and/or applications as XML.This is done through the FOR XML clause that can added to the end of a SELECT statement in SQL Server 2000:
From tablename
WHERE conditions
FOR XML
The database engine supports three modes of translation (if you install the client side components, you also get a fourth NESTED mode):
A normal SELECT statement always returns a single answer table where the structure of each row is identical. RAW mode generates an XML view of just such an answer table. You don’t get any clever hierarchical structuring: you essentially just get a table of data in XML format. Furthermore, the fact that RAW mode returns essentially a single row set means that it is capable of supporting a GROUP BY clause in the query.
Auto essentially generates an XML representation of relational data. It produces attribute-centric or element-centric XML. It will generate hierarchical XML from JOINs where the parent-child relationship is determined by the order of columns in the SELECT clause. You should always use an ORDER BY clause to ensure the correct nesting as AUTO mode does not support GROUP BY.
Explicit generates an XML representation of a universal table, which is essentially a tabular representation of an XML Document.
SQL Server and its adherence to the relational model is about to be seriously shaken up by the addition of an XML data type. We can complain that it offend the purity of the relational model (which it does); we can ignore the addition and hope it goes away (which it won’t); or we can embrace it as not only fun but also useful to the business community. I’ll go for the fun and useful option.
Click here for our Privacy Statement. Copyright © Matt Publishing. All rights reserved. No part of this site may be reproduced without the prior consent of the copyright holder.