
XML, a formal recommendation from the World Wide Web Consortium (W3C), is similar to the language of today's Web pages, the Hypertext Markup Language (HTML). Both XML and HTML contain markup symbols to describe the contents of a page or file. HTML, however, describes the content of a Web page (mainly text and graphic images) only in terms of how it is to be displayed and interacted with. For example, the letter "p" placed within markup tags starts a new paragraph. XML describes the content in terms of what data is being described. For example, the word "phonenum" placed within markup tags could indicate that the data that followed was a phone number. This means that an XML file can be processed purely as data by a program or it can be stored with similar data on another computer or, like an HTML file, that it can be displayed. For example, depending on how the application in the receiving computer wanted to handle the phone number, it could be stored, displayed, or dialed.
XML is "extensible" because, unlike HTML, the markup symbols are unlimited and self-defining. XML is actually a simpler and easier-to-use subset of the Standard Generalized Markup Language (SGML), the standard for how to create a document structure. It is expected that HTML and XML will be used together in many Web applications. XML markup, for example, may appear within an HTML page.
XML is a set of rules for encoding documents that can be read by machines and people. XML’s design goals emphasize simplicity, generality, and usability over the Internet. It is a textual data format, with strong support via Unicode for the languages of the world. Although XML’s design focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services.
There are a variety of programming interfaces which software developers may use to access XML data, and several schema systems designed to aid in the definition of XML-based languages.
As of 2009, hundreds of XML-based languages have been developed, including:
XML-based formats have become the default for most office-productivity tools, including Microsoft Office (Office Open XML), OpenOffice.org (OpenDocument), and Apple's iWork.
XML is:
XML was not designed to be a replacement for HTML. They serve different purposes. XML was designed to structure, store, and transport data, with focus on what data is while HTML was designed to display data, with focus on how data looks. HTML is about displaying information, while XML is about carrying information.
XML separates data from HTML. With XML, data can be stored in separate XML files. This way you can concentrate on using HTML for layout and display, and be sure that changes in the underlying data will not require any changes to the HTML. With a few lines of JavaScript, you can read an external XML file and update the data content of your HTML.
Since XML is saved in text format it allows data to be exchanged between incompatible systems and allows system upgrades to occur without data loss. With XML, your data can be available to all kinds of "reading machines" (Handheld computers, voice machines, news feeds, etc), and make it more available for blind people, or people with other disabilities.
XML Rules
XML custom tags themselves have no meaning and you should not think that you are creating custom HTML tags. It is important to understand that the sole purpose of these XML tags is to contain data. It is the value assigned to the tag, and not the tag, that is important.
An XML document that obeys the syntax rules is said to be well-formed. If the rules are not obeyed, you will get error messages. Fortunately, these rules are very simple.
<?xml version="1.0" encoding="UTF-8"?> <address_book> <record> <name> <first_name>first name goes here</first_name> <middle_name>middle name goes here</middle_name> <last_name>last name goes here</last_name> <nick_name>nick name goes here</nick_name> </name> </record> </address_book>
<first_name>first name goes here</first_name> (nested model) or: <employer id=1 /> (empty model)
<?xml version="1.0" encoding="UTF-8"?> <address_book> <record> <name> <first_name>first name goes here</first_name> <middle_name>middle name goes here</middle_name> <last_name>last name goes here</last_name> <nick_name>nick name goes here</nick_name> </name> </record> </address_book>
<record id="1">
<!-- You can comment XML code in this manner -->
These entity references are derived from SGML, hence their appearance in HTML.
< < > > & & ' ' " "
<?xml version="1.0" encoding="UTF-8"?>
<quiz>
<question type="multiple" number="1">
Which is (are) the root element of an HTML document?
<answers>
<answer choice="a"><HTML></answer>
<answer choice="b"></HTML></answer>
<answer choice="c"><HTML> & <body></answer>
<answer choice="d" correct="true"><HTML> & </HTML></answer>
</answers>
</question>
</quiz>
XML Tag Data
So what can be placed within a tag? Here we come across CDATA and PCDATA. For starters, any text placed within the tags are by default of type Parsed Character DATA (PCDATA). This means the data will be parsed by the XML parser. In contrast to PCDATA would be the Character DATA (CDATA), data that is not parsed by the parser.
As you can remember in our Entity example we had to use the < characters to encode the text HTML to make the parser replace it with <. But, CDATA (data) does not get parsed so there is no reason to use any entities. If you are very detailed you should also notice that there is some white space within the HTML and BODY in the choice "c". The white space in CDATA is preserved since the parser never parses this data and therefore the white space is not converted to a single white space as it would normally would be.
<question type="multiple" number="2">
Which is not a form of living matter?
<answers>
<answer choice="a" ><![CDATA[animal]]></answer>
<answer choice="b"><![CDATA[plant]]></answer>
<answer choice="c" ><![CDATA[bacteria]]></answer>
<answer choice="d" correct="true"><![CDATA[minerals]]></answer>
</answers>
</question>
Validation
In addition to being well-formed, an XML document may be valid. This means that it contains a reference to a Document Type Definition (DTD), and that its elements and attributes are declared in that DTD and follow the grammatical rules for them that the DTD specifies.
XML processors are classified as validating or non-validating depending on whether or not they check XML documents for validity. A processor which discovers a validity error must be able to report it, but may continue normal processing.
A DTD is an example of a schema or grammar. Since the initial publication of XML 1.0, there has been substantial work in the area of schema languages for XML. Such schema languages typically constrain the set of elements that may be used in a document, which attributes may be applied to them, the order in which they may appear, and the allowable parent/child relationships.
DTD:
The oldest schema language for XML is the Document Type Definition (DTD), inherited from SGML.
DTDs have the following benefits:
DTDs have the following limitations:
Two peculiar features that distinguish DTDs from other schema types are the syntactic support for embedding a DTD within XML documents and for defining entities, which are arbitrary fragments of text and/or markup that the XML processor inserts in the DTD itself and in the XML document wherever they are referenced, like character escapes.
DTD technology is still used in many applications because of its ubiquity.
Here is an example of XML using an external DTD for validation. It assumes that we can identify the DTD with the relative URI reference "example.dtd". The "people_list" after "!DOCTYPE" tells us that the root tags, or the first element defined in the DTD, is called "people_list":
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE people_list SYSTEM "example.dtd">
<people_list>
<person>
<name>Fred Bloggs</name>
<birthdate>27/11/2008</birthdate>
<gender>Male</gender>
</person>
</people_list>
Here is the DTD, example.dtd:
<!ELEMENT people_list (person*)> <!ELEMENT person (name, birthdate?, gender?, socialsecuritynumber?)> <!ELEMENT name (#PCDATA)> <!ELEMENT birthdate (#PCDATA)> <!ELEMENT gender (#PCDATA)> <!ELEMENT socialsecuritynumber (#PCDATA)>
Schema:
A newer schema language, described by the W3C as the successor of DTDs, is XML Schema, often referred to by the initialism for XML Schema instances, XSD (XML Schema Definition). XSDs are far more powerful than DTDs in describing XML languages. They use a rich datatyping system and allow for more detailed constraints on an XML document's logical structure. XSDs also use an XML-based format, which makes it possible to use ordinary XML tools to help process them.
3 Data Models
Nested elements are used to hold child elements, large blocks of text. Humans tend to code with (nested) elements.
In the example below we have elements holding data and other elements referred to as child elements. The data for the <first_name> element is Robert. The element <contact> holds 4 other elements, <first_name>, <last_name>, <nick_name>, and <email_address>.
<contact>
<first_name>Robert</first_name>
<last_name>Cormia</last_name>
<nick_name>Carbon Bob</nick_name>
<email_address>rdcormia@earthlink.net</email_address>
</contact>
Elements may contain attributes which are name/value pairs. You have seen them in the <meta> and the <img> tags. Attributes add 'granularity' to the definition (or description) of data (see mixed elements below) and are usually written by machines. Notice that empty elements close themselves.
<meta name="" value="" />
<meta name="description" value="description of the document" />
<meta name="keywords" value="keywords in the document" />
<meta name ="author" value="author of the document" />
<meta name ="copyright" value="copyright of the document" />
<meta name="fears" value="spiders, snakes, insects" />
<meta name="aptitude" value="interpersonal, instruction, counseling" />
<img src="../../images/notes_xml.jpg" width="100" height="100" alt="XML" />
When using empty elements here is a list of some of the drawbacks to keep in mind:
Elements may contain either attributes and text or attributes and other elements. Notice that mixed elements do NOT close themselves.
<name language="English">Cat</name> <name language="Latin">Cattus</name> <weight units="pounds">150</weight> <weight units="kilograms">68.2</weight>
There are examples of each of these models in the Example Files section.
XML Document Deconstructed
The simple.xml file contains information about a person, their contact information, what they do, their education, and any other general comments. Note, first, that this (and each well-formed) XML document starts with an XML declaration that defines both the XML version and character encoding used within the document itself. In this case the version is 1.0 and the character encoding is set to UTF-8 (8-bit UCS/Unicode Transformation Format, a variable-length character encoding for Unicode). There is only one root node and in this case it is named simple. Nested within it are the nodes, contact, general, and comments. Each of these nodes have nested nodes contained within them as well. The following is an example of simple.xml with data.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<simple>
<contact>
<first_name>Jane</first_name>
<last_name>Doe</last_name>
<nick_name></nick_name>
<email_address>janedoe@gmail.com</email_address>
</contact>
<general>
<occupation>Web Designer</occupation>
<education>BA Fine Arts</education>
<goals>Start Web Design Company</goals>
</general>
<comments>
<comment>Must take Dreamweaver</comment>
</comments>
</simple>
XML will not be displayed like HTML pages in the browser. The XML document will be displayed with color-coded root and child elements. A plus (+) or minus sign (-) to the left of the elements can be clicked to expand or collapse the element structure. To view the raw XML source (without the + and - signs), select "View Page Source" or "View Source" from the browser menu.
Note: In Chrome, Opera, and Safari, only the element text will be displayed. To view the raw XML, you must right click the page and select "View Source".
References and Resources
Tutorials
* click the link below to continue with the lesson *
RSS