# [ANN] stdio-0.2.0.0

Today I’m very happy to announce a new beta version of stdio is released on hackage, there’re some large additions in this version:

• Add UDP module.
• Add JSON module.
• Add ToText class to Std.Data.TextBuilder module.

The largest addition is a new JSON module, features full RFC 8259 compliance and 2~3x faster performance. Try to bring big improvements over a very mature library like aeson is not an easy task, parts of the improvements come from faster Parser and Builder infrastructures in stdio. But we also made a few other improvements:

• To improve building lookup table during JSON parsing(and many other formats as well), An ordered map module based on sorted vectors and binary-search is also added, there’re actually four modules providing similar functions to containers:

• Std.Data.Vector.FlatMap
• Std.Data.Vector.FlatIntMap
• Std.Data.Vector.FlatSet
• Std.Data.Vector.FlatIntSet

These data structures have different time and memory characteristics from tree-based ones, and suits some situations better, e.g. intermediate lookup table.

• To improve JSON string parsing, we made an UTF-8 validation state machine with JSON unescaping logic built-in written in C, combine with the ability to scan a whole chunk of input(provided by our new Parser, namely scanChunks), we now have faster JSON string parsing performance than some native code(JSON.parse from nodejs).

• Numeric parsing is also improved, e.g. we changed Integer decoding loop to use machine words if range permits, then covert the result to Integer rather than use Integer all the time.

There’re also lots of improvements and simplifications contribute to faster performance. Overall this is good proof that our new Parser and Builder infrastructures work well.

Besides the JSON module, another important addition is the ToText class in Std.Data.TextBuilder, which is also derivable via generics. The format is chosen to follow Show class as much as possible, this is a very handy class when you need to quickly turn something into a builder or text.

A UDP module is added quite easily since the IO manager part is in place. Due to the message based nature, the API is quite different from TCP, design suggestions are welcomed!

Happy hacking!
Cheers

# Aeson's mysterious lazy parsing

Today I’m hacking on a new JSON parser for my stdio package, of course i reused quite a lot of code from aeson. Kudos, Bryan O’Sullivan! While this time i really hacked into the core, I learned something i never really though about.

Aeson provide two flavor of parsing entrances: decode and decode' (and similarly eitherDecode and eitherDecode'). The document on decode says that This function parses immediately, but defers conversion.. I have never actually even though about it carefully because this looks quite obvious to a Haskell veteran like me. defers conversion? It must be the costs of convert numbers, strings, arrays, etc. are too high, so that during parsing we suspend the conversion and return a result containing thunks. So if our JSON instances only use part of the keys, we can save some time.

OK, that’s my interpretation of those docs for a long time, and i always prefer decode to decode' since thunking those conversions makes sense to me. But when i started to hack a JSON value parser from ground, suddenly I realized this is far more complex than my first look, a question is begging me to answer:

How could be this conversion deferring possible, when parsing should tell us if the JSON is valid?


There must be something wrong here. After parsing we should have something like Either String a, if we defer the conversions, how could we know it’s Right? If the deferred conversion somehow met problems, and we’ve already given a Right result, a bottom undefined will be produced? No this’s just unacceptable, we must do all the conversions before we can be sure the result is Right. But what’s the point of this lazy vs strict arrangement then?

Reading source quickly leads us to the difference before json and json' parser: i.e. it’s the conversions between JSON bytes and Value, aeson’s intermediate JSON representation, makes this differences. In json case, thunks are produced during parsing. But where’re those thunks? What makes things more mysterious is that Value use strict field all the way:

Look at those strict fields! It means that if we force a Value to WHNF, e.g. to know if it’s a Number or a String, we have to force the Scientific or Text payload. The only possible places for thunks are the payload of the payload of Objects and Arrays, which are payloads of a HashMap or a Vector respectively. So during building these HashMaps and Vectors, we write thunks instead of WHNF data structures. But why the laziness doesn’t bring bottom problem? If we have parsed the whole JSON bytes without pack results into HashMap or Vector, so where’re they? Reading the source of internal lazy parsers reveal the answer:

Parser is classic CPS parser from attoparsec package, In CPS parser’s world, a parser will parse its own part and call next parser if itself runs successfully. To get a successful parsing result, we have to run though all the parsers in a line, convert everything into a proper result and pass on. Look at the loop above, it will parse all the elements of a JSON array into a list before call next continuation. Now the important part comes:

Since aeson uses accumulator style to build the list, a reverse is neccessary, but this monadic return is not strict! so we build a thunk to defer the vector building process, but the list of elements is already there, in our memory, built as the parsing loop goes. So yes all parsing work is done, and we don’t have to worry about bottoms coming from deferred conversions: the only deferred conversion here is just a conversion between list of elements to vector of elements, which should never fail.

Similarly in HashMap case, we have already parsed all the key-values into a list of key-value pairs. The loop body of the Object parser is like this:

Now things are clear, we never miss a thing during parsing, and we only deferred two conversions when using aeson’s lazy paring. Now the question becomes how does this laziness helped things? Well if we modify the benchmarks in aeson’s repo, change the json parser to a strict json', suddenly the performance dropped by ~40%.

This big performance gap lies on Vector and HashMap building, some further benchmarks indicate it’s the HashMap building which is costing: calculating hashes and build a HAMT trees is not a easy job. If everything stops at this stage, and we never access fields of the HashMap, then we can say laziness have improved our performance. But this is just a false positive like many others, as long as we want to access a single field from the HashMap, the thunk will be entered and building cost will be paid, even if you don’t want every field of the HashMap, of course the Value nested inside the field still may contain thunks, but we’re actually not saving time in common case, where the final step of JSON parsing is to covert a JSON Value to a Haskell record.

From the memory perspective, lazy parsing is more problematic: we are hold lists instead of packed Vectors or HashMap, which means many indirection pointers and boxes can not be GCed, all the nodes of the JSON document is parsed anyway, and they are holded within Haskell lists, which are holded by conversion thunks. That’s just awful!

My conclusion: avoid use lazy parsing in aeson if possible, it’s not worked like what I expected, and brings no benefits in most of the case.

# stdio - A simple and high-performance IO toolkit for Haskell

Yesterday I and my friend Tao He write a short release message at a local cafe, we released a new IO library for GHC based on our previous work on combining libuv and GHC. Here I’d like to spend some time explaining what’s inside our new library besides the libuv binding part, what’s the motivation and what’s our solution.

# A new Bytes type

ByteString is used as the packed bytes type for years, it’s used as the buffer type in IO, serialization, parsing, etc. Because it uses pinned memory, thus allow everything works on Ptr ()s works on it. But this is really not ideal: pinned memory is only suitable for large blocks, if lots of small ByteStrings are created during runtime, memory fragmentation may happen. To address this problems, in stdio we do a couple of things:

• To solve the system FFI use case, we add a separated CBytes type:

Which always wrap an immutable null-terminated string, It’s used as the file path type, domain name type, etc. We use a rewrite rule so that a literal CBytes is constructed directly from a literal Addr# in O(1).

• We build a ByteArray# based vector type, which based on PrimArray in primitive package.

With this definition, GHC RTS can freely choose to allocate on nursery or pinned blocks depending on vector size, creating small vectors are much cheaper. When you want to pass this new vector to a safe FFI, you will have to check if it’s a pinned one, if not a pinned copy is made (in unsafe FFI case, you can pass the ByteArray# directly).

• Now makes Bytesan alias to PrimVector Word8.

There’s a key function that ByteString beats vector: break, it uses memchr if possible, which is SIMD enabled. This is very important because a high performance Parser will need a fast takeWhile (==x), which in turn rely on finding the first specific byte quickly. We leverage GHC rewrite rules to solve this problem: use memchr if types matches Bytes.

# Unified Vector

Like what we suggested in the last blog post, we build a unified vector for both boxed and unboxed vector. I didn’t provide separated typeclasses for slicing, indexing, etc, instead, I made it very clear that a vector is just a slice of an array:

Both boxed Vector and unboxed PrimVector ‘s definition is as simple as it can be:

In stdio vector combinators all work on Vec v a constraint, so something like type changing map, traverse, zipWith', etc are possible:

I even implemented a KMP searching algorithm with this unified vector interface:

Now enjoy searching an Int vector or boxed vector with O(n + k) time complexity!

# Builder and Parser

Our new Bytes requires a whole new machinery of constrution and deconstrution, traditional Builder and Parsers all works on Ptr ()s, which is not suitable for our new ByteArray# based Bytes anymore. When we first want to design a solution for our new Bytes type at the time of 2017, we met a problem though: we can’t do unaligned access on ByteArray#s, e.g. we can’t read or write 4 bytes as Word32 at a random index of a ByteArray# like a Ptr (), the primitives GHC provides only allow us to read or write on aligned indexes. So the work on stdio stalled, until in GHC 8.6.1 new prim-ops are added:

This is very important because if we can’t do unaligned access with ByteArray#, then we’re forced to break down a primitive value into bytes during serializing, or constructing from several bytes when parsing, which will be much slower than old Storable way. In stdio we provide UnalignedAccess type class to leverage these new prim-ops, which probably should be picked up in the next version of primitive package.

Now the missing piece is ready, we can have our Builder and Parser now. In stdio we design something different: A Builder monad not only support building Bytes chunks, but also support building short strict Bytes. That is, we support different AllocateStrategys:

This is an internal type to implement different allocate strategies, and the Builder monad use this information to handle writing across buffer boundaries. The final API exposed to the user is very nature:

As for Parser type, we choose a very simple CPS formula:

After I implemented this Parser then I found it’s exactly the same one with scanner package, it’s not a coincidence though: in my own benchmark this formula is as fast as the non-resumable one from store package!

Unlike what Yuras claimed in scanner package, it’s actually quite easy to provide an Alternative instance for this formula. And if a user doesn’t use it, there will be no backtracking cost at all.

We provide high quality numeric Builders and Parsers as well, the IEEE floating formatting is based on my grusi3 patch, which is much faster.

Some extra words on Builder here: why do we choose a monadic interface instead of monoid one? The reasons in my mind are:

• A monadic interface is more powerful, you can do something like accumulating a verification value while building bytes, or add an extra ReaderT layer to do some building constant injections.

• A monadic interface is more symmetrical to Parser, e.g. we probably have something like

• A monadic interface works extremely well with Control.Monad, while control structures on monoid is a bit of lacking.

# Buffered IO

Buffered IO is a fundamental module to any IO toolkit since it can greatly boost IO performance. In base we have Handle type which is powerful but very complex. In stdio we designed a rather simple solution based on the idea from io-streams package: a buffered input device should support push back operation, so that running a resumable Parser which only consumes part of the buffer is not a problem. The API of buffered IO is as following:

Nothing fancy, every function should work as you expected. Note we don’t provide thread-safe guarantees here: to use a BufferedInput or BufferedOutput in multiple threads, use an MVar to protect concurrent access.

# More to come

For now, stdio is already in its shape. We have put a lot of time on handcrafting over last two years. Lots of experiments, lots of tests. We hope this little project can start to make some of your jobs easier, but of course, it’s still in its early stage. There’re a lot of functionalities we want to add to it eventually, here is something on the roadmap:

• Implement a binary serialization module similar to binary, cereal package.
• Implement a JSON module.
• Support UDP device.
• Add DNS functionality, i.e getaddrinfo binding.
• Add HTTP protocol support.
• Add TLS(transfer layer security) support.

Both I and Tao He know that maintain a project with size like this will be out of our capability, so we’d like to ask help from Haskell community here. How can we grow this project further? In which direction our community needs it to be? We’d like to hear your opinions, and of course, your user experience report with stdio!

# An unified array interface

Recently I’m trying to merge some array code from my research project stdio to primitive, one of the core haskell libraries. I hope the reviewing process can be finished soon so that people can start using it. This post is a summary on current GHC array support(up to 8.2.1, things may changed when you read it), and what my patch is about.