Tim Jansen's blog


2004/08/27
'yield', generators and coroutines
One of Eek's planned features are coroutines. I have recently discovered this blog entry that shows the use of the 'yield' statement in C#, and it is quite unusual. C#'s 'yield' statement seems to be limited to implementing functions that return the IEnumerable type (C#'s iterator interface). This makes sense in many cases, as every 'yielding' method will return a sequence of values and the number of values may be limited. But it makes it less useful for some purposes in which it can be used just to move variables from the class into the method. Microsoft's implementation is also quite unusual and uses a state machine, more can be found here. The following class shows a co-routine in Eek, the way I want to implement them. The class has a method that adds the argument to the sum of the previous arguments and returns it. 'yield' allows implementing it without any member properties:
class YieldTest
        int accumulate(int a)
                int sum = 0
                while true
                        sum += a
                        yield sum
end
This is the 'yield' that I want for Eek. It works exactly like 'return' and returns the argument, just on the next invocation on the same instance it executes the command after 'yield' instead of restarting the method. In this case the 'while' loop is restarted after yielding the value. The method is a co-routine only because it contains 'yield'. There is no special declaration, as co-routines are just a special case of regular subroutines aka methods. The coroutine ends when the method returns using the regular 'return' statement. This 'return' must always return a value, like a regular 'return'. If the method does not have any return values (or the return values have default values), it is also possible to end the method by executing the last statement. When a coroutine ends it is not allowed to invoke it anymore. Any attempt will return an exception. The state of the coroutine is stored the instance, if the coroutine is an instance method; in the class, if the coroutine is a static method; and in the delegate if the coroutine is a closure. So you can have several 'accumulate()' methods running simultanously, as long as they run in different classes. Co-routines need, as far as I can see, two limitations. The first one is that it must not be allowed to have a 'yield' in a block with 'finally' section or in a 'using' statement. This is not possible, because they are supposed to guarantee a cleanup, but there is no way to guarantee it. The cleanup can't be done immediately after a 'yield' invocation, because there is no guarantee that the co-routine is called again. And cleaning up after every 'yield' does not make any sense. The second limitation is that recursion must not be allowed. Recursion in a co-routine would be quite a mess and I would not know how to implement it. Unfortunately, this restriction means that a co-routine must maintain a flag to prevent recursion. This flag needs to be checked when the co-routine is entered to find out whether it is already running. If yes, an exception is thrown. Otherwise the flag is set, the co-routine executed and then the flag cleared. This solution is easy, but causes an performance impact that makes co-routines slower than regular methods. In some cases it may be possible for the compiler to eliminate the check though. Finally, there is one open point remaining: multi-threading. I am not sure what should happen when two threads try to enter a co-routine simultanously. Because it may mess up the heap, it's probably necessary to prevent this from happening. Unfortunately this requires every co-routine to be locked with a mutex, making it even slower. To get back to the C# implementation, it has one advantage: it makes it easy to implement an IEnumeration, which is probably a very common case. But I expect that with some simple magic and the help of closures, Eek will still be able to return iterators quite easily. This is what a iterator-returning method could look like:
class YieldGeneratorTest
        Iterator countTo5()
                // Generator is a Iterator implementation that takes a coroutine closure.
                return Generator({
                        yield 1
                        yield 2
                        yield 3
                        yield 4
                        return 5
                })
end
I think that example is acceptable, and still cleaner than implementing the Iterator directly like C# does.



2004/08/04
EXML (Eek's XML)
I am talking a lot about XML support in Eek, but I don't actually mean XML as defined by the W3C. I am rather talking about a simplified data model that is (mostly) compatible with XML, based on the 10 Things I Hate About XML entry from last december. So in order to prevent unnecessary confusion, I decided to call it EXML (Eek's XML). EXML Data Model EXML is mainly a data model. It's purpose is similar to XML Infosets or the XPath 2.0 Data Model. The data model uses a class hierarchy to describe the entities. Everything in EXML is a Node. But EXML has only two types of nodes: Elements and Values. Elements are like Elements in XML, they have a name and can contain a sequence of other nodes as children. Additionally an Element may have an unlimited number of unordered attributes, which are named values. Attributes are not Nodes themselves, only their Values are. The names of elements and attributes are QNames, thus they have a local name and an optional URI. Value is an abstract class, the Value is typed and uses sub-classes like String, Int, Float, Date, Blob, QName and so on. In the Eek API the Values subclasses are the Eek's core types, so the int object that you get when you write "int i = 0" inherits from Value and can be put directly in an EXML tree. Only Elements know their parent. It is not possibly to find out the parent of a Value. EXML does not know a document type. A document is a tree of Nodes, which must have exactly one root element. The root is simply an Element that has no parent. EXML Serialization Format Serializing EXML is easy: just return it as XML, UTF-8 encoded and without prolog (yes, that's legal XML). Values are printed in their canonical representation, as defined by XML Schema Part 2. Reading it is a little bit more difficult. There are two problems. The first is whitespace. In XML documents whitespace is frequently used to increase the readability, especially for indenting. However, most of that whitespace is not significant at all, makes good XML applications more complicated and bad ones less reliable. So I feel that a solution is needed, and it goes like this: if an element contains only a value, the value's whitespace is significant. Otherwise, if the element contains mixed content (values and elements), all pure whitespace values are ignored, unless the 'xml:space' attribute is set to 'preserve'. 'xml:space' is part of the regular XML spec, but only serves as a hint. EXML makes its use mandatory. The second problem with reading are the representation of values. Unfortunately, for whatever reason, XML Schema Part 2 defines a single canonical representation for every type, but requires applications to read a variety of formats. For example the canonical representation of the number 5 as integer is '5', but with XML Schema the app must also read '+0005'. The EXML parser can't know whether the document's author just wanted to use a really complicated way to write 5, or whether the strange notation needs to be kept. So EXML considers all XML text to be Strings, unless it is a canonical representation of at least one other type. Then it takes the 'preferred' type. For example the preferred type for the number 5 is 'int', and not 'long'. At a later time EXML may get an advanced parser that uses a schema to convert text to the right type, but until then the user can only rely on having the right Value type when the source is a original EXML document with canonical representation for Values. Removed Features EXML removes the following features from XML:
  • The prolog at the beginning. There is only one encoding, and that is UTF-8. (If really needed, UTF-16 could be added, but that can be detected at the UTF level)
  • DTDs and <!DOCTYPE> and all that crap that has long been replaced by schemas
  • Entities, except the built-ins (&lt;, &amp;, &quot; and &#number;)
  • Processing instructions
  • CData sections
  • xml:lang
  • Comments in the data model (but not in the serialization format, they are just really ignored)
I don't think that anybody who's using modern XML infrastructure would miss any of them. And they are what makes XML so complicated. In case someone needs them for obscure reasons, there should be an option that allows representing old stuff like processing instructions as elements in a special namespace. This way all the original XML features can be accessed while keeping the data model simple. But this would be completely optional, because it can break schemas and applications. XML Compatibility All or at least most modern XML standards can still be used with EXML. I hope to be able to use it with SOAP and the other WS standards, XPath, XSLT, XQuery, RelaxNG and (if I can't avoid it) XML Schema. In some cases there may be limitations needed. For example XPath will need minor modifications, because not all EXML nodes can retrieve their parent node - EXML Values can not. But in general this should not be a problem.



 

This blog is my dumping ground for thoughts and ideas about Eek. Someday Eek will be a programming language and system, somewhat comparable to Java in scope. It is my attempt to bring sanity to the world of computing.
At least I hope so. Right now it is far from being finished and I can't guarantee that it ever will be. I am still working on the specification, but I won't release anything before I got my first prototype running. The world does not need more vapourware and unusable beta-software. All publicly available information about Eek is contained in this blog. You can find the latest summary here.
This page is powered by Blogger. Isn't yours? Creative Commons License
This work is licensed under a Creative Commons License.