Serialisation – Comparing XML, SDL, TOML, JSON

Script files are an important asset for a lot of games.

There are a lot of different uses for script files. In general, we use the term to describe two different files, those containing scripts – as in code – and those containing data.

In this post we will look at the second kind. Specifically we will look at a number of representations of actual objects we might encounter in our game environment.

We will further only look at human readable formats. Binary formats certainly have their place, but for most purposes having files that can be opened and changed in any text editor has huge advantages that I do not want to miss.

I will introduce four examples, chosen somewhat arbitrarily. The goal is to see some of the different approaches people take, and explore the advantages and disadvantages of either using an example.

While I will give a recommendation in the end, there is no one right format for all purposes. One should always look at the specific requirements of a given problem, and choose the technology most applicable.

XML

The first format I want to look at is XML.

XML is old, well known, and many (de)serialisation libraries exist for almost any programming language.

Example

Before we look at the actual XML, let us first set up an example.

Below you see a small class serving as a template for units in for example a strategy game. It has simple properties, a list of string identifiers and a list of parametrised objects.

While I would never write this code in a production setting, it serves as a good representation for the kind of data we will try to express with our script files.

If we represent the data of this object in an XML file, it might look something like this:

Overall, this is certainly not terrible. It is very clear and the meaning of the different tags, attributes and their content is unambiguous.

![http://genericgamedev.com/wp-content/uploads/2015/04/serialisation-header.jpg]

It is however quite verbose. Most of the file is taken up by tags, instead of our data.

We can improve on this to some degree by making better use of attributes and closed tags as follows.

This is a much more concise solution. However, we still have a lot of tags that seem somewhat redundant.

Here is an outline of the semantic data we want to represent, with as little formatting and syntactic necessities as follows:

There must be a way to represent this data in a concise and readable form, without as many tags as XML requires us to use.

Apart form the verbosity, there are other problems with XML, which prevent it from being suitable for our purposes.

One of them is the distinction between attributes and tags. While for its originally intended usage – representing documents (think of the related HTML) – this makes sense, for us, there is no difference.

Having two syntactic options to represent a single semantic can be confusing and lead to inconsistent usage. This may make editing XML files by hand significantly harder.

SDL

One alternative approach to XML is SDL, the Simple Declarative Language.

Translating our XML from about into SDL results in the following:

Now, this certainly is concise. There is hardly anything here not corresponding directly to our data.

Also, something which I personally like especially is that SDL makes a clear distinction between numbers ans strings.

I am still mixing attributes and content just like above. However, we can easily change that without making the script much longer:

This is maybe even more readable.

However, we still remain with the same ambiguity as above, for when to use attributes, and when to use properties.

Further, while this format is very readable, writing valid files may be more difficult. Note how some identifiers are wrapped in quotes and others are not (and having both for the same property is valid as well).

While the syntax of SDL defines how to handle the different cases, having to keep these rules in mind may be very confusing and lead to enough syntax errors to make writing SDL by hand impractical.

Also note how there is no real difference between lists with objects as elements and objects with properties, similar to XML. This again does not necessarily result in an intuitive representation of our data.

TOML

Another format we could consider is TOML, Tom’s Obvious, Minimal Language.

Expressing our data in TOML might result in this:

As we can see, TOML is also able to express our data very concisely.

We also have a clear difference between strings and numbers, and lists of objects are represented differently than objects with properties.

There are no attributes, only properties, removing another source of ambiguity.

Overall, I think TOML is a neat format, but I still doubt whether it is the right thing to represent the kind of data in question.

My main criticism is that it feels somewhat unstructured to me. While the nesting is clearly defined and for the most part easy to understand and write, it is not necessarily obvious at a first glance.

It is however a format that seems well suitable for simpler cases, like settings or configuration files, which is in fact its stated purpose.

JSON

JSON – JavaScript Object Notation – is as the name implies a subset of JavaScript.

Our data represented in JSON might look like this:

We again have something that looks a bit more verbose. This is mostly caused by JSON being very explicit with nesting. Every object must be surrounded with { } while every list is surrounded with [ ].

On the upside, this makes the relation between different properties very clear. With most programmers being used to C-style languages, these notations could be considered intuitive for at least the majority of programmers.

JSON also does not have a concept of attributes. There are only named properties of different types.

We are not stuck with the above verbose form of JSON, should be decide to use it however.

For example, here is a small pattern I like to use when considering a list of similar objects that have different properties and are identified by a string identifier:

This makes a clear distinction between the type and properties of the objects, and saves us some typing at the same time.

It can also significantly simplify deserialisation, since we are bound to read the name – and here type – of the object before reading its properties.

Note how I already used the same pattern in the first example for the entire object itself.

Further, like XML – but unlike the other two formats – JSON is white-space ignorant. This allows us to format our file in a much more compact form, while still keeping it just as readable:

In fact, note how closely this resembles our original C# code:

The conversion is virtually one to one.

While this is not surprising, given the origins of JSON, it shows how well it is suited to represent the kind of data we are dealing with.

On a last note, in many applications, each script file is likely to contain only a single object – in this case unit template.

In that case, we can of course simplify our code even further, making it contain only the essentials:

Further, when shipping our game, we could compress the files by removing all unnecessary white-space, turning it into a single line. That will both save space, and slightly improve parsing performance.

Comparison

As I am sure is obvious from my comments above, I strongly dislike the distinction between attributes and content/properties.

I further consider grouping objects and lists using brackets a positive feature, since it leaves nesting unambiguous, and clearly maps onto data representations in source code.

That is why JSON is my clear favourite of the above – and in fact any other format I have come across so far.

The only thing I do not like about JSON is that it allows any string as name for properties. Consequently, property names have to be wrapped in quotes, just like string values.

Would I define my own clear-text data storage format, I would take JSON, remove those quotes, and only allow alpha-numerical identifiers as property names.

Other than this, I have not found any fault with JSON, despite now using it heavily for several years.

Conclusion

Above I highlighted some of the differences between clear-text data storage formats XML, JSON, and the lesser known SDL and TOML.

I gave some arguments for why I consider them more or less suitable for different uses.

While any of them, and any number of other formats can be used to represent the same data, I hope I gave a coherent explanation for why I prefer JSON.

In either case, let me know what you think!
Do you agree with my opinions and arguments?
What formats or languages do you use, and why?

Make sure to leave a comment and feel free to share this post with anyone who may be interested.

Next week I will continue to expand on this topic and look into serialising and deserialising JSON in C#, using Json.NET.

Leave a Reply

One comment

  1. Mark McHenry says:

    Nice comparison.

    I think your criticism of TOML is a little too subjective:
    “My main criticism is that it feels somewhat unstructured to me. … not necessarily obvious at a first glance.”

    Actually, I prefer the dot notation in TOML over braces in JSON. Dot notation is pretty standard in many OO languages. In TOML you can immediately determine the parents, but in JSON you have to visually scan the file looking for and recording the braces to find all the parents. This can be error prone if there are a lot of parents, a big file, or poorly formatted. (Your test file was pretty small and nicely formatted.) Also, TOML allows comments, whereas JSON does not.

    Between the two, I think it’s a toss up, but your specific needs may drive you to use one over the other.