The purpose of this document is to introduce a series of upcoming tools for data exchange and the reasons for their existence. First published 2010-08-01.
Experience shows that the problem of exchanging, storing and evolving data would benefit from being split into independent subproblems.
Although data usually have one canonical way of being thought of, a variety of technical constraints call for a variety of data formats and implementations.
The first problem is therefore to come up with a common language for describing data types without relying on features that are specific to a particular programming language or data format.
The second problem is to allow any programming language to join the party without having to redefine and reimplement the tools for describing the data types.
The third problem is to allow different serialization formats to be used to represent the same data.
Instead of pushing for a supposedly best combination of programming language and data format, we acknowledge that real-world data will be represented in different ways by different tools no matter what.
After a little bit of thinking and not so much tweaking, all data can be described using:
These are the core features allowed by the ATD syntax (ATD = "Adjustable Type Definitions"). In practice, each programming language may have a choice of different representations of a particular ATD type. The idea here is to allow annotations in the ATD file that can be used to specify language-specific options.
Although the syntax of ATD is strongly based on the syntax of OCaml type definitions, and although the current tools using ATD are implemented in OCaml, the target languages can be really anything. OCaml shines when it comes to processing code and produce more code. The language of the input code here is ATD but the output can be anything: OCaml code, Java code, JSON, HTML documentation, etc.
It is of course possible to reimplement the tools of the atd
package in another programming language at a reasonable cost, that is maybe
a few weeks to months of work for a clean job,
but this is not expected to happen anytime soon.
Using the atd
library for building code generators
for other languages than OCaml makes perfect sense and is something
OCaml is suited for.
To date at MyLife we have been using the following tools all based on ATD
type specifications:
atdgen
and using
yojson
for runtime support,
A variety of data formats (JSON, XML, etc.) can be used and data types can be specified using ATD. The ATD syntax allows for annotations in various places which make it possible to adjust the basic ATD type definition to the idioms of the target language.
Several tools that make the ATD language relevant will be released around the same time. This is a list of features that these tools offer.
atd
is the OCaml library providing a parser for the ATD language
and various utilities. ATD stands for Adjustable Type Definitions
in reference to its main property of supporting annotations
that allow a good fit with a variety of data formats.
(* This is an ATD file *) type 'a tree = [ Node of ('a tree * 'a tree) Leaf of 'a ] type record = { name : string; (* Required field *) ~friends : string list <ocaml repr="array">; (* Optional field with a default value, by default the empty list. <ocaml repr="array"> is an annotation for the OCaml code generators that only them need to interpret. *) ?descr : string option; (* Optional field without a default value. *) tree : int tree; }
The atd
package provides:
atdcat
, a program that reads type definitions, optionally
applies inheritance or monomorphization, and prints them back.
An interesting use of ATD annotations is that <doc text=...>
nodes can be used to specify comments applicable to the generated code.
Such comments can be interpreted by the code generators
converted into ocamldoc-compliant or javadoc-compliant
comments, allowing the production of quality
HTML documentation.
Yojson is an optimized parser/printer
and pretty-printer for JSON.
It addresses a few limitations of
json-wheel
and provides a number of low-level runtime functions
on which code generated by
atdgen
hooks up.
The main differences with json-wheel
are:
Biniou (pronounced "be new") is a new binary format vastly equivalent to JSON since it has the following properties:
Field names and variant names are represented using 31-bit hashes like method names and polymorphic variants in OCaml.
Strings have no encoding requirement and are stored without any escaping.
Arrays of records can be represented using a specific representation called tables. A table does not repeat field information shared by all its records, resulting in space gains.
Biniou data typically take 25-30% less space than their JSON equivalent.
biniou
is the OCaml package that provides optimized readers,
writers and pretty-printers for the biniou format.
The library also provides the runtime functions used by the code
generated by atdgen
,
as well as the buffer types used by yojson
.
atdgen
is a program that generates optimized OCaml code
for reading and writing either biniou or JSON data.
Generated code directly converts
between byte buffers and the desired OCaml representation without
going through a generic tree like
json-static does.
Benchmarks performed on an amd64-Linux machine for combined reading and writing show that:
atdgen-json
produces code that is 3 times faster than
json-static but 4 times slower than atdgen-biniou
.
atdgen-biniou
is 2-3 times slower than
OCaml's Marshal.