Lesson 11 - XML


DW Cover


XML, a formal recommendation from the World Wide Web Consortium (W3C), is similar to the language of today's Web pages, the Hypertext Markup Language (HTML). Both XML and HTML contain markup symbols to describe the contents of a page or file. HTML, however, describes the content of a Web page (mainly text and graphic images) only in terms of how it is to be displayed and interacted with. For example, the letter "p" placed within markup tags starts a new paragraph. XML describes the content in terms of what data is being described. For example, the word "phonenum" placed within markup tags could indicate that the data that followed was a phone number. This means that an XML file can be processed purely as data by a program or it can be stored with similar data on another computer or, like an HTML file, that it can be displayed. For example, depending on how the application in the receiving computer wanted to handle the phone number, it could be stored, displayed, or dialed.

XML is "extensible" because, unlike HTML, the markup symbols are unlimited and self-defining. XML is actually a simpler and easier-to-use subset of the Standard Generalized Markup Language (SGML), the standard for how to create a document structure. It is expected that HTML and XML will be used together in many Web applications. XML markup, for example, may appear within an HTML page.

XML is a set of rules for encoding documents that can be read by machines and people. XML’s design goals emphasize simplicity, generality, and usability over the Internet. It is a textual data format, with strong support via Unicode for the languages of the world. Although XML’s design focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services.

There are a variety of programming interfaces which software developers may use to access XML data, and several schema systems designed to aid in the definition of XML-based languages.

As of 2009, hundreds of XML-based languages have been developed, including:

XML-based formats have become the default for most office-productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org (OpenDocument), and Apple's iWork.

XML is:

XML was not designed to be a replacement for HTML. They serve different purposes. XML was designed to structure, store, and transport data, with focus on what data is while HTML was designed to display data, with focus on how data looks. HTML is about displaying information, while XML is about carrying information.

XML separates data from HTML. With XML, data can be stored in separate XML files. This way you can concentrate on using HTML for layout and display, and be sure that changes in the underlying data will not require any changes to the HTML. With a few lines of JavaScript, you can read an external XML file and update the data content of your HTML.

Since XML is saved in text format it allows data to be exchanged between incompatible systems and allows system upgrades to occur without data loss. With XML, your data can be available to all kinds of "reading machines" (Handheld computers, voice machines, news feeds, etc), and make it more available for blind people, or people with other disabilities.


XML Rules

XML custom tags themselves have no meaning and you should not think that you are creating custom HTML tags. It is important to understand that the sole purpose of these XML tags is to contain data. It is the value assigned to the tag, and not the tag, that is important.

An XML document that obeys the syntax rules is said to be well-formed. If the rules are not obeyed, you will get error messages. Fortunately, these rules are very simple.

  1. An XML tag must start with an alphabetic character (a-z), or underscore(_), and the tags are case-sensetive. An XML tag can not begin with the character combination of "xml"- it's a reserved name. XML tags are case sensitive, <Fish> and <fish> are two different tags.
  2. There may be one and only one root tag:

    <?xml version="1.0" encoding="UTF-8"?>
    <address_book>
        <record>
            <name>
                <first_name>first name goes here</first_name>
                <middle_name>middle name goes here</middle_name>
                <last_name>last name goes here</last_name>
                <nick_name>nick name goes here</nick_name>
            </name>
        </record>
    </address_book>
  3. All of your custom XML tags must be closed:

    <first_name>first name goes here</first_name> (nested model)
    
        or:
    
    <employer id=1 /> (empty model)
  4. Inner nested tags must be closed before you close the outer nested tags:

    <?xml version="1.0" encoding="UTF-8"?>
    <address_book>
        <record>
            <name>
                <first_name>first name goes here</first_name>
                <middle_name>middle name goes here</middle_name>
                <last_name>last name goes here</last_name>
                <nick_name>nick name goes here</nick_name>
            </name>
        </record>
    </address_book>
  5. The attribute value must be enclosed by a pair of single or double quotes.

     <record id="1">
  6. Comments in XML are denoted by the text in blue as shown in the box:

    <!-- You can comment XML code in this manner -->
  7. The XML language has 5 built in entity references:
    &lt; <
    &gt; >
    &amp; &
    &apos; '
    &quot; "
    These entity references are derived from SGML, hence their appearance in HTML.

    <?xml version="1.0" encoding="UTF-8"?>
      <quiz>
        <question type="multiple" number="1">
          Which is (are) the root element of an HTML document?
          <answers>
            <answer choice="a">&lt;HTML&gt;</answer>
            <answer choice="b">&lt;/HTML&gt;</answer>
            <answer choice="c">&lt;HTML&gt; &amp; &lt;body&gt;</answer>
            <answer choice="d" correct="true">&lt;HTML&gt; &amp; &lt;/HTML&gt;</answer>
          </answers>
        </question>
      </quiz>

XML Tag Data

So what can be placed within a tag? Here we come across CDATA and PCDATA. For starters, any text placed within the tags are by default of type Parsed Character DATA (PCDATA). This means the data will be parsed by the XML parser. In contrast to PCDATA would be the Character DATA (CDATA), data that is not parsed by the parser.

As you can remember in our Entity example we had to use the &lt; characters to encode the text HTML to make the parser replace it with <. But, CDATA (data) does not get parsed so there is no reason to use any entities. If you are very detailed you should also notice that there is some white space within the HTML and BODY in the choice "c". The white space in CDATA is preserved since the parser never parses this data and therefore the white space is not converted to a single white space as it would normally would be.

<question type="multiple" number="2">
    Which is not a form of living matter?
    <answers>                 
        <answer choice="a" ><![CDATA[animal]]></answer>
        <answer choice="b"><![CDATA[plant]]></answer>
        <answer choice="c" ><![CDATA[bacteria]]></answer>
        <answer choice="d" correct="true"><![CDATA[minerals]]></answer>
    </answers>
</question>  


Validation

In addition to being well-formed, an XML document may be valid. This means that it contains a reference to a Document Type Definition (DTD), and that its elements and attributes are declared in that DTD and follow the grammatical rules for them that the DTD specifies.

XML processors are classified as validating or non-validating depending on whether or not they check XML documents for validity. A processor which discovers a validity error must be able to report it, but may continue normal processing.

A DTD is an example of a schema or grammar. Since the initial publication of XML 1.0, there has been substantial work in the area of schema languages for XML. Such schema languages typically constrain the set of elements that may be used in a document, which attributes may be applied to them, the order in which they may appear, and the allowable parent/child relationships.

DTD:

The oldest schema language for XML is the Document Type Definition (DTD), inherited from SGML.

DTDs have the following benefits:

DTDs have the following limitations:

Two peculiar features that distinguish DTDs from other schema types are the syntactic support for embedding a DTD within XML documents and for defining entities, which are arbitrary fragments of text and/or markup that the XML processor inserts in the DTD itself and in the XML document wherever they are referenced, like character escapes.

DTD technology is still used in many applications because of its ubiquity.

Here is an example of XML using an external DTD for validation. It assumes that we can identify the DTD with the relative URI reference "example.dtd". The "people_list" after "!DOCTYPE" tells us that the root tags, or the first element defined in the DTD, is called "people_list":

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE people_list SYSTEM "example.dtd">
<people_list>
  <person>
    <name>Fred Bloggs</name>
    <birthdate>27/11/2008</birthdate>
    <gender>Male</gender>
  </person>
</people_list>

Here is the DTD, example.dtd:

<!ELEMENT people_list (person*)>
<!ELEMENT person (name, birthdate?, gender?, socialsecuritynumber?)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT birthdate (#PCDATA)>
<!ELEMENT gender (#PCDATA)>
<!ELEMENT socialsecuritynumber (#PCDATA)>
  1. people_list is a valid element name, and an instance of such an element contains any number of person elements. The * denotes there can be 0 or more person elements within the people_list element.
  2. person is a valid element name, and an instance of such an element contains one element named name, followed by one named birthdate (optional), then gender (also optional) and socialsecuritynumber (also optional). The ? indicates that an element is optional. The reference to the name element name has no ?, so a person element must contain a name element.
  3. name is a valid element name, and an instance of such an element contains "parsed character data" (#PCDATA).
  4. birthdate is a valid element name, and an instance of such an element contains parsed character data.
  5. gender is a valid element name, and an instance of such an element contains parsed character data.
  6. socialsecuritynumber is a valid element name, and an instance of such an element contains parsed character data.

Schema:

A newer schema language, described by the W3C as the successor of DTDs, is XML Schema, often referred to by the initialism for XML Schema instances, XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing XML languages. They use a rich datatyping system and allow for more detailed constraints on an XML document's logical structure. XSDs also use an XML-based format, which makes it possible to use ordinary XML tools to help process them.


3 Data Models

  1. Nested Elements (tags only).

    Nested elements are used to hold child elements, large blocks of text. Humans tend to code with (nested) elements.

    In the example below we have elements holding data and other elements referred to as child elements. The data for the <first_name> element is Robert. The element <contact> holds 4 other elements, <first_name>, <last_name>, <nick_name>, and <email_address>.


    <contact>
        <first_name>Robert</first_name>
        <last_name>Cormia</last_name>
        <nick_name>Carbon Bob</nick_name>
        <email_address>rdcormia@earthlink.net</email_address>
    </contact>
    

  2. Empty Elements (attributes)

    Elements may contain attributes which are name/value pairs. You have seen them in the <meta> and the <img> tags. Attributes add 'granularity' to the definition (or description) of data (see mixed elements below) and are usually written by machines. Notice that empty elements close themselves.

    <meta name="" value="" />
    
    <meta name="description" value="description of the document" />
    <meta name="keywords" value="keywords in the document" />
    <meta name ="author" value="author of the document" />
    <meta name ="copyright" value="copyright of the document" />
    <meta name="fears" value="spiders, snakes, insects" />
    <meta name="aptitude" value="interpersonal, instruction, counseling" />
    
    <img src="../../images/notes_xml.jpg" width="100" height="100" alt="XML" />
          

    When using empty elements here is a list of some of the drawbacks to keep in mind:

    • cannot contain multiple values (nested elements can)
    • not easily expandable (for future changes)
    • cannot describe structures (nested elements can)
    • more difficult to manipulate by program code
    • attribute values are not easy to test against a DTD

  3. Mixed Elements (tags and attributes)

    Elements may contain either attributes and text or attributes and other elements. Notice that mixed elements do NOT close themselves.

    <name language="English">Cat</name>
    <name language="Latin">Cattus</name>
    
    <weight units="pounds">150</weight>
    <weight units="kilograms">68.2</weight>      

There are examples of each of these models in the Example Files section.


XML Document Deconstructed

The simple.xml file contains information about a person, their contact information, what they do, their education, and any other general comments. Note, first, that this (and each well-formed) XML document starts with an XML declaration that defines both the XML version and character encoding used within the document itself. In this case the version is 1.0 and the character encoding is set to UTF-8 (8-bit UCS/Unicode Transformation Format, a variable-length character encoding for Unicode). There is only one root node and in this case it is named simple. Nested within it are the nodes, contact, general, and comments. Each of these nodes have nested nodes contained within them as well. The following is an example of simple.xml with data.

  <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  <simple>
      <contact>
          <first_name>Jane</first_name>
          <last_name>Doe</last_name>
          <nick_name></nick_name>
          <email_address>janedoe@gmail.com</email_address>
      </contact>
      <general>
          <occupation>Web Designer</occupation>
          <education>BA Fine Arts</education>
          <goals>Start Web Design Company</goals>
      </general>
      <comments>
          <comment>Must take Dreamweaver</comment>
      </comments>
  </simple>

XML will not be displayed like HTML pages in the browser. The XML document will be displayed with color-coded root and child elements. A plus (+) or minus sign (-) to the left of the elements can be clicked to expand or collapse the element structure. To view the raw XML source (without the + and - signs), select "View Page Source" or "View Source" from the browser menu.

Note: In Chrome, Opera, and Safari, only the element text will be displayed. To view the raw XML, you must right click the page and select "View Source".


Example Files


References and Resources


Tutorials

* click the link below to continue with the lesson *
RSS