Aeson's mysterious lazy parsing

Today I’m hacking on a new JSON parser for my stdio package, of course i reused quite a lot of code from aeson. Kudos, Bryan O’Sullivan! While this time i really hacked into the core, I learned something i never really though about.

Aeson provide two flavor of parsing entrances: decode and decode' (and similarly eitherDecode and eitherDecode'). The document on decode says that This function parses immediately, but defers conversion.. I have never actually even though about it carefully because this looks quite obvious to a Haskell veteran like me. defers conversion? It must be the costs of convert numbers, strings, arrays, etc. are too high, so that during parsing we suspend the conversion and return a result containing thunks. So if our JSON instances only use part of the keys, we can save some time.

OK, that’s my interpretation of those docs for a long time, and i always prefer decode to decode' since thunking those conversions makes sense to me. But when i started to hack a JSON value parser from ground, suddenly I realized this is far more complex than my first look, a question is begging me to answer:

How could be this `conversion deferring` possible, when parsing should tell us if the JSON is valid? 

There must be something wrong here. After parsing we should have something like Either String a, if we defer the conversions, how could we know it’s Right? If the deferred conversion somehow met problems, and we’ve already given a Right result, a bottom undefined will be produced? No this’s just unacceptable, we must do all the conversions before we can be sure the result is Right. But what’s the point of this lazy vs strict arrangement then?

Reading source quickly leads us to the difference before json and json' parser: i.e. it’s the conversions between JSON bytes and Value, aeson’s intermediate JSON representation, makes this differences. In json case, thunks are produced during parsing. But where’re those thunks? What makes things more mysterious is that Value use strict field all the way:

1
2
3
4
5
6
7
8
9
10
11
type Object = HashMap Text Value
type Array = Vector Value

-- | aeson's intermediate JSON representation
data Value = Object !Object
| Array !Array
| String !Text
| Number !Scientific
| Bool !Bool
| Null
deriving (Eq, Read, Show, Typeable, Data, Generic)

Look at those strict fields! It means that if we force a Value to WHNF, e.g. to know if it’s a Number or a String, we have to force the Scientific or Text payload. The only possible places for thunks are the payload of the payload of Objects and Arrays, which are payloads of a HashMap or a Vector respectively. So during building these HashMaps and Vectors, we write thunks instead of WHNF data structures. But why the laziness doesn’t bring bottom problem? If we have parsed the whole JSON bytes without pack results into HashMap or Vector, so where’re they? Reading the source of internal lazy parsers reveal the answer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
arrayValues :: Parser Value -> Parser (Vector Value)
arrayValues val = do
skipSpace
w <- A.peekWord8'
if w == CLOSE_SQUARE
then A.anyWord8 >> return Vector.empty
else loop [] 1
where
loop acc !len = do
v <- val <* skipSpace
ch <- A.satisfy $ \w -> w == COMMA || w == CLOSE_SQUARE
if ch == COMMA
then skipSpace >> loop (v:acc) (len+1)
else return (Vector.reverse (Vector.fromListN len (v:acc)))

Parser is classic CPS parser from attoparsec package, In CPS parser’s world, a parser will parse its own part and call next parser if itself runs successfully. To get a successful parsing result, we have to run though all the parsers in a line, convert everything into a proper result and pass on. Look at the loop above, it will parse all the elements of a JSON array into a list before call next continuation. Now the important part comes:

1
return (Vector.reverse (Vector.fromListN len (v:acc)))

Since aeson uses accumulator style to build the list, a reverse is neccessary, but this monadic return is not strict! so we build a thunk to defer the vector building process, but the list of elements is already there, in our memory, built as the parsing loop goes. So yes all parsing work is done, and we don’t have to worry about bottoms coming from deferred conversions: the only deferred conversion here is just a conversion between list of elements to vector of elements, which should never fail.

Similarly in HashMap case, we have already parsed all the key-values into a list of key-value pairs. The loop body of the Object parser is like this:

1
2
3
4
5
6
7
8
loop acc = do
k <- str <* skipSpace <* char ':'
v <- val <* skipSpace
ch <- A.satisfy $ \w -> w == COMMA || w == CLOSE_CURLY
let acc' = (k, v) : acc
if ch == COMMA
then skipSpace >> loop acc'
else return (H.fromList acc')

Now things are clear, we never miss a thing during parsing, and we only deferred two conversions when using aeson’s lazy paring. Now the question becomes how does this laziness helped things? Well if we modify the benchmarks in aeson’s repo, change the json parser to a strict json', suddenly the performance dropped by ~40%.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# change to use a strict parser
json-data/twitter1.json :: 60000 times
0.8 KB: 68836 msg\/sec (56.1 MB\/sec)
json-data/twitter10.json :: 13000 times
6.4 KB: 11712 msg\/sec (73.7 MB\/sec)
json-data/twitter20.json :: 7500 times
11.8 KB: 5568 msg\/sec (64.1 MB\/sec)
json-data/twitter50.json :: 2500 times
31.2 KB: 2045 msg\/sec (62.3 MB\/sec)
json-data/twitter100.json :: 1000 times
61.5 KB: 927 msg\/sec (55.7 MB\/sec)
json-data/jp10.json :: 4000 times
14.6 KB: 6087 msg\/sec (87.0 MB\/sec)
json-data/jp50.json :: 1200 times
44.1 KB: 1751 msg\/sec (75.4 MB\/sec)
json-data/jp100.json :: 700 times
82.9 KB: 820 msg\/sec (66.3 MB\/sec)

# now is the lazy parser
json-data/twitter1.json :: 60000 times
0.8 KB: 97350 msg\/sec (79.4 MB\/sec)
json-data/twitter10.json :: 13000 times
6.4 KB: 17675 msg\/sec (111.1 MB\/sec)
json-data/twitter20.json :: 7500 times
11.8 KB: 8692 msg\/sec (100.0 MB\/sec)
json-data/twitter50.json :: 2500 times
31.2 KB: 3171 msg\/sec (96.6 MB\/sec)
json-data/twitter100.json :: 1000 times
61.5 KB: 1392 msg\/sec (83.6 MB\/sec)
json-data/jp10.json :: 4000 times
14.6 KB: 8435 msg\/sec (120.5 MB\/sec)
json-data/jp50.json :: 1200 times
44.1 KB: 2430 msg\/sec (104.6 MB\/sec)
json-data/jp100.json :: 700 times
82.9 KB: 1119 msg\/sec (90.6 MB\/sec)

This big performance gap lies on Vector and HashMap building, some further benchmarks indicate it’s the HashMap building which is costing: calculating hashes and build a HAMT trees is not a easy job. If everything stops at this stage, and we never access fields of the HashMap, then we can say laziness have improved our performance. But this is just a false positive like many others, as long as we want to access a single field from the HashMap, the thunk will be entered and building cost will be paid, even if you don’t want every field of the HashMap, of course the Value nested inside the field still may contain thunks, but we’re actually not saving time in common case, where the final step of JSON parsing is to covert a JSON Value to a Haskell record.

From the memory perspective, lazy parsing is more problematic: we are hold lists instead of packed Vectors or HashMap, which means many indirection pointers and boxes can not be GCed, all the nodes of the JSON document is parsed anyway, and they are holded within Haskell lists, which are holded by conversion thunks. That’s just awful!

My conclusion: avoid use lazy parsing in aeson if possible, it’s not worked like what I expected, and brings no benefits in most of the case.

stdio - A simple and high-performance IO toolkit for Haskell

Yesterday I and my friend Tao He write a short release message at a local cafe, we released a new IO library for GHC based on our previous work on combining libuv and GHC. Here I’d like to spend some time explaining what’s inside our new library besides the libuv binding part, what’s the motivation and what’s our solution.

A new Bytes type

ByteString is used as the packed bytes type for years, it’s used as the buffer type in IO, serialization, parsing, etc. Because it uses pinned memory, thus allow everything works on Ptr ()s works on it. But this is really not ideal: pinned memory is only suitable for large blocks, if lots of small ByteStrings are created during runtime, memory fragmentation may happen. To address this problems, in stdio we do a couple of things:

  • To solve the system FFI use case, we add a separated CBytes type:
1
2
3
data CBytes
= CBytesOnHeap {-# UNPACK #-} !(PrimArray Word8) -- ^ On heap pinned 'PrimArray'
| CBytesLiteral {-# UNPACK #-} !CString -- ^ String literals with static address

Which always wrap an immutable null-terminated string, It’s used as the file path type, domain name type, etc. We use a rewrite rule so that a literal CBytes is constructed directly from a literal Addr# in O(1).

  • We build a ByteArray# based vector type, which based on PrimArray in primitive package.
1
2
3
4
5
6
7
-- from primitive >= 0.6.4.0
data PrimArray a = PrimArray ByteArray#

data PrimVector a = PrimVector
{-# UNPACK #-} !(PrimArray a) -- ^ payload
{-# UNPACK #-} !Int -- ^ offset in elements of type a rather than in bytes
{-# UNPACK #-} !Int -- ^ length in elements of type a rather than in bytes

With this definition, GHC RTS can freely choose to allocate on nursery or pinned blocks depending on vector size, creating small vectors are much cheaper. When you want to pass this new vector to a safe FFI, you will have to check if it’s a pinned one, if not a pinned copy is made (in unsafe FFI case, you can pass the ByteArray# directly).

  • Now makes Bytesan alias to PrimVector Word8.

There’s a key function that ByteString beats vector: break, it uses memchr if possible, which is SIMD enabled. This is very important because a high performance Parser will need a fast takeWhile (==x), which in turn rely on finding the first specific byte quickly. We leverage GHC rewrite rules to solve this problem: use memchr if types matches Bytes.

Unified Vector

Like what we suggested in the last blog post, we build a unified vector for both boxed and unboxed vector. I didn’t provide separated typeclasses for slicing, indexing, etc, instead, I made it very clear that a vector is just a slice of an array:

1
2
3
4
5
class (Arr (MArray v) (IArray v) a) => Vec v a where
type MArray v = (marr :: * -> * -> *) | marr -> v
type IArray v = (iarr :: * -> *) | iarr -> v
toArr :: v a -> (IArray v a, Int, Int)
fromArr :: IArray v a -> Int -> Int -> v a

Both boxed Vector and unboxed PrimVector ‘s definition is as simple as it can be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
data Vector a = Vector
{-# UNPACK #-} !(SmallArray a) -- ^ payload
{-# UNPACK #-} !Int -- ^ offset
{-# UNPACK #-} !Int -- ^ length

instance Vec Vector a where
type MArray Vector = SmallMutableArray
type IArray Vector = SmallArray
toArr (Vector arr s l) = (arr, s, l)
fromArr = Vector

-- for PrimVector defined above
instance Prim a => Vec PrimVector a where
type MArray PrimVector = MutablePrimArray
type IArray PrimVector = PrimArray
toArr (PrimVector arr s l) = (arr, s, l)
fromArr = PrimVector

In stdio vector combinators all work on Vec v a constraint, so something like type changing map, traverse, zipWith', etc are possible:

1
2
3
4
-- We don't require result vector type is the same with input one 
map :: forall u v a b. (Vec u a, Vec v b) => (a -> b) -> u a -> v b
traverse :: (Vec v a, Vec u b, Applicative f) => (a -> f b) -> v a -> f (u b)
zipWith' :: (Vec v a, Vec u b, Vec w c) => (a -> b -> c) -> v a -> u b -> w c

I even implemented a KMP searching algorithm with this unified vector interface:

1
indices :: (Vec v a, Eq a) => v a -> v a -> Bool -> [Int]

Now enjoy searching an Int vector or boxed vector with O(n + k) time complexity!

Builder and Parser

Our new Bytes requires a whole new machinery of constrution and deconstrution, traditional Builder and Parsers all works on Ptr ()s, which is not suitable for our new ByteArray# based Bytes anymore. When we first want to design a solution for our new Bytes type at the time of 2017, we met a problem though: we can’t do unaligned access on ByteArray#s, e.g. we can’t read or write 4 bytes as Word32 at a random index of a ByteArray# like a Ptr (), the primitives GHC provides only allow us to read or write on aligned indexes. So the work on stdio stalled, until in GHC 8.6.1 new prim-ops are added:

1
2
3
4
indexWord8ArrayAsWord16# :: ByteArray# -> Int# -> Word#
indexWord8ArrayAsWord32# :: ByteArray# -> Int# -> Word#
indexWord8ArrayAsWord64# :: ByteArray# -> Int# -> Word#
...

This is very important because if we can’t do unaligned access with ByteArray#, then we’re forced to break down a primitive value into bytes during serializing, or constructing from several bytes when parsing, which will be much slower than old Storable way. In stdio we provide UnalignedAccess type class to leverage these new prim-ops, which probably should be picked up in the next version of primitive package.

Now the missing piece is ready, we can have our Builder and Parser now. In stdio we design something different: A Builder monad not only support building Bytes chunks, but also support building short strict Bytes. That is, we support different AllocateStrategys:

1
2
3
4
data AllocateStrategy s
= DoubleBuffer -- Double the buffer and continue building
| InsertChunk {-# UNPACK #-} !Int -- Insert a new chunk and continue building
| OneShotAction (V.Bytes -> ST s ()) -- Freeze current chunk and perform action with it.

This is an internal type to implement different allocate strategies, and the Builder monad use this information to handle writing across buffer boundaries. The final API exposed to the user is very nature:

1
2
3
buildBytes :: Builder a -> V.Bytes
buildBytesList :: Builder a -> [V.Bytes]
buildAndRun :: (V.Bytes -> IO ()) -> Builder a -> IO ()

As for Parser type, we choose a very simple CPS formula:

1
2
3
4
5
6
7
8
data Result a
= Success !V.Bytes a
| Failure !V.Bytes String
| Partial (V.Bytes -> Result a)

type ParseStep r = V.Bytes -> Result r

newtype Parser a = Parser { runParser :: forall r . (a -> ParseStep r) -> ParseStep r }

After I implemented this Parser then I found it’s exactly the same one with scanner package, it’s not a coincidence though: in my own benchmark this formula is as fast as the non-resumable one from store package!

Unlike what Yuras claimed in scanner package, it’s actually quite easy to provide an Alternative instance for this formula. And if a user doesn’t use it, there will be no backtracking cost at all.

We provide high quality numeric Builders and Parsers as well, the IEEE floating formatting is based on my grusi3 patch, which is much faster.

Some extra words on Builder here: why do we choose a monadic interface instead of monoid one? The reasons in my mind are:

  • A monadic interface is more powerful, you can do something like accumulating a verification value while building bytes, or add an extra ReaderT layer to do some building constant injections.

  • A monadic interface is more symmetrical to Parser, e.g. we probably have something like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import qualified Std.Data.Builder as B
import qualified Std.Data.Parser as P

buildFoo :: Foo -> Builder ()
buildFoo (Foo x y z) = do
B.bytes x
B.encodePrim magic
B.encodePrim y
B.double z

parseFoo :: Parser Foo
parseFoo = do
x <- P.takeWhile (/=magic)
P.word8 magic
y <- P.decodePrim
z <- P.double
return (Foo x y z)
  • A monadic interface works extremely well with Control.Monad, while control structures on monoid is a bit of lacking.

Buffered IO

Buffered IO is a fundamental module to any IO toolkit since it can greatly boost IO performance. In base we have Handle type which is powerful but very complex. In stdio we designed a rather simple solution based on the idea from io-streams package: a buffered input device should support push back operation, so that running a resumable Parser which only consumes part of the buffer is not a problem. The API of buffered IO is as following:

1
2
3
4
5
6
7
8
9
readBuffer :: (HasCallStack, Input i) => BufferedInput i -> IO V.Bytes
readExactly :: (HasCallStack, Input i) => Int -> BufferedInput i -> IO V.Bytes
readToMagic :: (HasCallStack, Input i) => Word8 -> BufferedInput i -> IO V.Bytes
readParser :: (HasCallStack, Input i) => P.Parser a -> BufferedInput i -> IO (ReadResult a)
...
writeBuffer :: (Output o) => BufferedOutput o -> V.Bytes -> IO ()
writeBuilder :: (Output o) => BufferedOutput o -> B.Builder a -> IO ()
flushBuffer :: Output f => BufferedOutput f -> IO ()
...

Nothing fancy, every function should work as you expected. Note we don’t provide thread-safe guarantees here: to use a BufferedInput or BufferedOutput in multiple threads, use an MVar to protect concurrent access.

More to come

For now, stdio is already in its shape. We have put a lot of time on handcrafting over last two years. Lots of experiments, lots of tests. We hope this little project can start to make some of your jobs easier, but of course, it’s still in its early stage. There’re a lot of functionalities we want to add to it eventually, here is something on the roadmap:

  • Implement a binary serialization module similar to binary, cereal package.
  • Implement a JSON module.
  • Support UDP device.
  • Add DNS functionality, i.e getaddrinfo binding.
  • Add HTTP protocol support.
  • Add TLS(transfer layer security) support.

Both I and Tao He know that maintain a project with size like this will be out of our capability, so we’d like to ask help from Haskell community here. How can we grow this project further? In which direction our community needs it to be? We’d like to hear your opinions, and of course, your user experience report with stdio!