Today I’m hacking on a new JSON parser for my stdio package, of course i reused quite a lot of code from aeson. Kudos, Bryan O’Sullivan! While this time i really hacked into the core, I learned something i never really though about.
Aeson provide two flavor of parsing entrances:
decode' (and similarly
eitherDecode'). The document on
decode says that
This function parses immediately, but defers conversion.. I have never actually even though about it carefully because this looks quite obvious to a Haskell veteran like me.
defers conversion? It must be the costs of convert numbers, strings, arrays, etc. are too high, so that during parsing we suspend the conversion and return a result containing thunks. So if our JSON instances only use part of the keys, we can save some time.
OK, that’s my interpretation of those docs for a long time, and i always prefer
decode' since thunking those conversions makes sense to me. But when i started to hack a JSON value parser from ground, suddenly I realized this is far more complex than my first look, a question is begging me to answer:
How could be this `conversion deferring` possible, when parsing should tell us if the JSON is valid?
There must be something wrong here. After parsing we should have something like
Either String a, if we defer the conversions, how could we know it’s
Right? If the deferred conversion somehow met problems, and we’ve already given a
Right result, a bottom
undefined will be produced? No this’s just unacceptable, we must do all the conversions before we can be sure the result is
Right. But what’s the point of this lazy vs strict arrangement then?
Reading source quickly leads us to the difference before
json' parser: i.e. it’s the conversions between JSON bytes and
Value, aeson’s intermediate JSON representation, makes this differences. In
json case, thunks are produced during parsing. But where’re those thunks? What makes things more mysterious is that
Value use strict field all the way:
type Object = HashMap Text Value
Look at those strict fields! It means that if we force a
Value to WHNF, e.g. to know if it’s a
Number or a
String, we have to force the
Text payload. The only possible places for thunks are the payload of the payload of
Arrays, which are payloads of a
HashMap or a
Vector respectively. So during building these
Vectors, we write thunks instead of WHNF data structures. But why the laziness doesn’t bring bottom problem? If we have parsed the whole JSON bytes without pack results into
Vector, so where’re they? Reading the source of internal lazy parsers reveal the answer:
arrayValues :: Parser Value -> Parser (Vector Value)
Parser is classic CPS parser from
attoparsec package, In CPS parser’s world, a parser will parse its own part and call next parser if itself runs successfully. To get a successful parsing result, we have to run though all the parsers in a line, convert everything into a proper result and pass on. Look at the
loop above, it will parse all the elements of a JSON array into a list before call next continuation. Now the important part comes:
return (Vector.reverse (Vector.fromListN len (v:acc)))
Since aeson uses accumulator style to build the list, a
reverse is neccessary, but this monadic
return is not strict! so we build a thunk to defer the vector building process, but the list of elements is already there, in our memory, built as the parsing loop goes. So yes all parsing work is done, and we don’t have to worry about bottoms coming from deferred conversions: the only deferred conversion here is just a conversion between list of elements to vector of elements, which should never fail.
HashMap case, we have already parsed all the key-values into a list of key-value pairs. The loop body of the
Object parser is like this:
loop acc = do
Now things are clear, we never miss a thing during parsing, and we only deferred two conversions when using aeson’s lazy paring. Now the question becomes how does this laziness helped things? Well if we modify the benchmarks in aeson’s repo, change the
json parser to a strict
json', suddenly the performance dropped by ~40%.
# change to use a strict parser
This big performance gap lies on
HashMap building, some further benchmarks indicate it’s the
HashMap building which is costing: calculating hashes and build a HAMT trees is not a easy job. If everything stops at this stage, and we never access fields of the
HashMap, then we can say laziness have improved our performance. But this is just a false positive like many others, as long as we want to access a single field from the
HashMap, the thunk will be entered and building cost will be paid, even if you don’t want every field of the
HashMap, of course the
Value nested inside the field still may contain thunks, but we’re actually not saving time in common case, where the final step of JSON parsing is to covert a JSON
Value to a Haskell record.
From the memory perspective, lazy parsing is more problematic: we are hold lists instead of packed
HashMap, which means many indirection pointers and boxes can not be GCed, all the nodes of the JSON document is parsed anyway, and they are holded within Haskell lists, which are holded by conversion thunks. That’s just awful!
My conclusion: avoid use lazy parsing in aeson if possible, it’s not worked like what I expected, and brings no benefits in most of the case.