From troff to HTML5

Writing a parser to convert troff man pages to HTML5

Jackson Pauls By Jackson Pauls

ManKier is now about 3 years old, but its core has changed substantially in the last year.

Originally, HTML man pages were created using doclifter: this outputs DocBook XML, which was then converted to the HTML used on mankier.com. This toolchain worked to some extent, but with increasing limitations and corner cases — workarounds started to pile up. I tried working with the doclifter source. It's all one massive file, even after splitting it up into modules I struggled to properly grok it. Finally, doclifter outputs DocBook XML, and I needed HTML.

So I wrote a troff parser.

Sure, I could just run something like groff -m mandoc -Thtml ls.1 > ls.html but that just translates presentational troff macros into a rough HTML equivalent - the great thing doclifter does is translate the presentational troff into semantic markup. Semantic markup, as well as being better for accessibility, makes it possible to do more with the renderings. For example, if we can identify an option definition, we can make that option an anchor (e.g.: rg -F), and provide a way to quickly explain what a bunch of options do.

Writing a troff parser has some easy wins, as most man pages use the man macros. For instance:

.TH LS 1
.SH OPTIONS
.TP
\fB\-a\fR, \fB\-\-all\fR
do them all!
...

translates nicely to this HTML5:

<title>LS.1</title>
<main>
  <h1>LS.1</h1>
  <section>
    <h2>OPTIONS</h2>
    <dl>
      <dt><strong>-a</strong>, <strong>--all</strong></dt>
      <dd>do them all!</dd>
      ...
  </section>
</main>

But troff has been around for decades, and can also get pretty gnarly. wireshark(1), for instance, does this:

.ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.ds #V .6m
.ds #F 0
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
Jo\*:rg Mayer

To write “Jörg Mayer”. I think somebody had a lot of fun with that one. I should perhaps have paid more heed to this comment before trying to disentangle it:

.\" Fear. Run. Save yourself. No user-serviceable parts.

Fortunately those pages also give an alternate string definition “for low resolution devices”, which just gives us “Joerg Mayer”...

And if you like puzzles, check out the recursive .Xf macro defined in rc(1):

.if t .ds Cf C
.if n .ds Cf R
.\" Rc - Alternate Roman and Courier
.de Rc
.Xf R \\*(Cf \& "\\$1" "\\$2" "\\$3" "\\$4" "\\$5" "\\$6"
..
.de Xf
.ds Xi
.if "\\$1"I" .if !"\\$5"" .ds Xi \^
.if !"\\$4"" .Xf \\$2 \\$1 "\\$3\\f\\$1\\$4\\*(Xi" "\\$5" "\\$6" "\\$7" "\\$8" "\\$9"
.if "\\$4"" \\$3\fR\s10
..
.Rc ( & ).

This will simply output “(&)” or “(&)”, depending on whether your formatter is troff or nroff.

If you want to find out more, groff(7) is the “short” reference which can elucidate the two gnarly examples above.

ManKier's parser can now generate HTML5 for 99.98% of man pages. If you spot any pages that render horribly on mankier.com, do drop me an email. Getting this far has certainly been challenging, but also of course rather rewarding. Next step is to get that up to 100% — there are still a handful of esoteric troff requests to implement!