Tim Jansen's blog


2004/08/04
EXML (Eek's XML)
I am talking a lot about XML support in Eek, but I don't actually mean XML as defined by the W3C. I am rather talking about a simplified data model that is (mostly) compatible with XML, based on the 10 Things I Hate About XML entry from last december. So in order to prevent unnecessary confusion, I decided to call it EXML (Eek's XML). EXML Data Model EXML is mainly a data model. It's purpose is similar to XML Infosets or the XPath 2.0 Data Model. The data model uses a class hierarchy to describe the entities. Everything in EXML is a Node. But EXML has only two types of nodes: Elements and Values. Elements are like Elements in XML, they have a name and can contain a sequence of other nodes as children. Additionally an Element may have an unlimited number of unordered attributes, which are named values. Attributes are not Nodes themselves, only their Values are. The names of elements and attributes are QNames, thus they have a local name and an optional URI. Value is an abstract class, the Value is typed and uses sub-classes like String, Int, Float, Date, Blob, QName and so on. In the Eek API the Values subclasses are the Eek's core types, so the int object that you get when you write "int i = 0" inherits from Value and can be put directly in an EXML tree. Only Elements know their parent. It is not possibly to find out the parent of a Value. EXML does not know a document type. A document is a tree of Nodes, which must have exactly one root element. The root is simply an Element that has no parent. EXML Serialization Format Serializing EXML is easy: just return it as XML, UTF-8 encoded and without prolog (yes, that's legal XML). Values are printed in their canonical representation, as defined by XML Schema Part 2. Reading it is a little bit more difficult. There are two problems. The first is whitespace. In XML documents whitespace is frequently used to increase the readability, especially for indenting. However, most of that whitespace is not significant at all, makes good XML applications more complicated and bad ones less reliable. So I feel that a solution is needed, and it goes like this: if an element contains only a value, the value's whitespace is significant. Otherwise, if the element contains mixed content (values and elements), all pure whitespace values are ignored, unless the 'xml:space' attribute is set to 'preserve'. 'xml:space' is part of the regular XML spec, but only serves as a hint. EXML makes its use mandatory. The second problem with reading are the representation of values. Unfortunately, for whatever reason, XML Schema Part 2 defines a single canonical representation for every type, but requires applications to read a variety of formats. For example the canonical representation of the number 5 as integer is '5', but with XML Schema the app must also read '+0005'. The EXML parser can't know whether the document's author just wanted to use a really complicated way to write 5, or whether the strange notation needs to be kept. So EXML considers all XML text to be Strings, unless it is a canonical representation of at least one other type. Then it takes the 'preferred' type. For example the preferred type for the number 5 is 'int', and not 'long'. At a later time EXML may get an advanced parser that uses a schema to convert text to the right type, but until then the user can only rely on having the right Value type when the source is a original EXML document with canonical representation for Values. Removed Features EXML removes the following features from XML:
  • The prolog at the beginning. There is only one encoding, and that is UTF-8. (If really needed, UTF-16 could be added, but that can be detected at the UTF level)
  • DTDs and <!DOCTYPE> and all that crap that has long been replaced by schemas
  • Entities, except the built-ins (&lt;, &amp;, &quot; and &#number;)
  • Processing instructions
  • CData sections
  • xml:lang
  • Comments in the data model (but not in the serialization format, they are just really ignored)
I don't think that anybody who's using modern XML infrastructure would miss any of them. And they are what makes XML so complicated. In case someone needs them for obscure reasons, there should be an option that allows representing old stuff like processing instructions as elements in a special namespace. This way all the original XML features can be accessed while keeping the data model simple. But this would be completely optional, because it can break schemas and applications. XML Compatibility All or at least most modern XML standards can still be used with EXML. I hope to be able to use it with SOAP and the other WS standards, XPath, XSLT, XQuery, RelaxNG and (if I can't avoid it) XML Schema. In some cases there may be limitations needed. For example XPath will need minor modifications, because not all EXML nodes can retrieve their parent node - EXML Values can not. But in general this should not be a problem.


 

This blog is my dumping ground for thoughts and ideas about Eek. Someday Eek will be a programming language and system, somewhat comparable to Java in scope. It is my attempt to bring sanity to the world of computing.
At least I hope so. Right now it is far from being finished and I can't guarantee that it ever will be. I am still working on the specification, but I won't release anything before I got my first prototype running. The world does not need more vapourware and unusable beta-software. All publicly available information about Eek is contained in this blog. You can find the latest summary here.
This page is powered by Blogger. Isn't yours? Creative Commons License
This work is licensed under a Creative Commons License.