TL;DR: Don’t normalize pretty
’s Doc
s. This library relies on laziness in a foundational way.
If you want to know what happens when you DeepSeq.force
a Doc
, and why the world then collapses on you, read on.
I worked on the problem in Agda that warnings would not print in color on the terminal (as opposed to errors), even though they were produced as annotated Doc
s and there was some renderer that turned the annotations into ANSI codes when printing.
It turned out that this renderer was not used for warnings as they were communicated through a data structure with just String
messages and thus got render
ed too early, losing the annotations.
My strategy to tackle this problem was to replace the String
by Doc
in the data structure and let GHC point me to the places I had to update.
There was suprisingly little to do, and I was quite happy, having expected a large refactoring ordeal.
The bill came when I tried to run the testsuite.
It always stalled after a short while, having to be killed, and I found in the Activity Monitor that each of the running agda
processes took 5GB.
With 8 processes spawned by the testsuite, my machine was soon swapping itself to a halt.
Once I even saw agda
eating 90GB before it crashed and took the Terminal App down with it.
Crying for help in my pull request Amélia analysed the situation as the normal form of a Doc
is horrible and should never be stored.
Her analysis was backed by some colored heap profiles.
Indeed I had turned a DeepSeq.force
on a rendered Doc
(so, a String
) into one on the Doc
itself.
(The purpose of the normalization was to bring exceptions to the surface and handle them, in order to not make internal errors in (debug) printing crash the whole process—but that detail is not important.)
Commonly space problems are caused by retaining thunks and can be fixed by normalization, but here I encountered the opposite problem, so it took some time to understand it.
Amélia was (of course) right and I implemented her suggested fix, to render annotated Doc
s into simple annotated trees (DocTree
) so as to preserve the annotations but prevent the memory explosion of normalized Doc
s.
pretty
What is going on? Why cannot we normalize Doc
s?
The problem is the Union
constructor in Doc
that holds two documents that render to the same text (with maybe different layout).
Relying on laziness, usually the second alternative is only used when the first does not work out well.
However, NFData
’s rnf
will normalize both alternatives.
Also, the way that the pretty
DSL is designed, common Doc
constructors are implemented shallowly (“smart constructors”), so they distribute over Union
. For instance, take beside
which is ubiquitous because it is invoked whenever you concatenate documents horizontally (such as in <>
, <+>
, hcat
, hsep
…).
beside (p1 `Union` p2) g q = beside p1 g q `union_` beside p2 g q
You can see that the arguments g
and q
are duplicated when beside
distributes itself over a Union
. And unions are also ubiquitous as they are introduced by sep
s (excerpt):
-- Specification: sep ps = oneLiner (hsep ps) `union` vcat ps ... sepNB g Empty k ys = oneLiner (nilBeside g (reduceDoc rest)) `mkUnion` nilAboveNest False k (reduceDoc (vcat ys))
So the normal forms of Doc
are quadratic and laziness is essential to make the pretty
DSL work.
Having a normalizer like NFData
is contrary to the whole design of this library.
NFData
instance for Doc
?The blameful NFData Doc
instance exists since 2014, while the library is much older than that, going back to articles by John Hughes from 1992 and 1995.
Originally, the library was internal to GHC, coded by Simon Peyton Jones in 1996.
It was first publicly released to Hackage by Ross Paterson in 2007.
The genesis of the NFData Doc
instance as recorded in issue #12 can be paraphrased as follows:
Contributor: “Could we have a NFData
instance? Could be useful.”
Maintainer: “Yeah, why not?”
This is not the original wording, but summarizes well the (publicly available) communication on the topic.
Let’s be fair. I can see myself in both roles. I could easily have been the contributor seeing a gap in the API and wanting to fill it. I could easily have been the maintainer seeing no fault in the proposal and waving it through.
We are confronted by a fundamental problem in software engineering here (amplified in the resource-starved Haskell open-source community): The original authors of the software leave, taking their deep understanding and unspoken dos and don’ts with them, and second generation maintainers have to take over without being equipped with the expertise and insights of the original authors.
In the case of pretty
s Doc
s, filling the gap of a missing NFData
instance must be considered a bad idea, in hindsight at least.
But I guess the original author would immediately have reacted with concerns to the proposal.
The main competitor on the pretty-printing market is likely prettyprinter
, an heir of the Wadler/Leijen pretty printer.
Neither of these equips their Doc
type with a NFData
instance (see prettyprinter
’s Doc
and wl-pprint
s Doc
).
In the former case, Doc
embeds functions, so normalization wouldn’t get far anyway.
The NFData
instance for pretty
s Doc
should be deprecated, warning the user of the unexpected fatal behavior.
I opened an issue suggesting this.
If one wanted a Doc
type that supports normalization without blowing up, one would need a deeper embedding, making all combinators that now duplicate documents into proper constructors of Doc
.
This would however be a major rewrite of the library, and there might be unforseen obstacles on this path.
Thanks to Amélia for finding the problem and outlining a solution.
Thanks to the developers of pretty
the other pretty-printing libraries, part of the army of unpaid volunteers in the Haskell community.
Write a comment by opening an issue in the issue tracker!