The core of the system is html2man.rb--a Ruby utility script that converts HTML pages to nroff/troff* man pages. The utility has been in use for several years, operating on files that were at times 80 pages long. (There were 5 distinct formats in those files, Some had TOCs and some didn't, and there was a variety of header styles, for example. So the tool is nothing if not robust.)
Wrapped around that utility is a Rake script that does the builds, along with some convenience scripts:
- html2man which sets environment variables and invokes html2man.rb
- man which (once configured) uses the local man page processor to display the results, and
- zap which removes an output file so that running the script causes it to be regenerated
- The file format is "roff", but nobody calls it that. The program that converts into postscript and the like for sophisticated display devices is troff. The program that converts for character-display devices is nroff. Since man pages are converted into a character-display format, the files are generally called "nroff files".
Standard OT processing for man pages has been reported to be problematic, in several respects:
- Generated nroff files have an initial blank line and are missing the required initial header, which prevent a man page processor from rendering them properly.
- Not all reference-topic elements show up in the output
- Lack of support for tables (stated more bluntly as Tables are ignored)--especially important when tables provide what is, in effect, a third level of nesting, since the standard man page format only has an H1 title and H2 heads.
- Output files have a .cli extension,
- Section headers are duplicated.
- Multiple newlines at end of file cause the page to scroll out of view when displayed.
- Cross-reference failures
If the DITA-OT is customized to overcome such problems, the customizations must be carried forward each time the OT is updated--unless the XSLT customizations are put back into the toolkit, which is a very good idea. But, to date, that hasn't been done, so everyone generating man pages faces the same set of issues.
In addition, the man page processor uses a language (Ruby) that has sufficient strength to implement needed column-weighting heuristics for tables, as described in the Issues section below.
- Incoming HTML pages must adhere to standard man page format with H2 section heads, at most one H1 title, and no H3 or lower subheads. (The project includes an HTML template and test pages that can be used as a guide for formatting.)
- Man pages often have a small table of hieroglyphics at the end with pointers to obscure modules. (I'm sure they make sense to someone, but they're Greek to me.) Such man-page-specific tables, if needed, must be handled in the DITA processing. (Normal web pages generated from the DITA files would not include those tables, but output generated for man pages should.) The generation-process needs to use conditional metadata or a composition strategy to take that output-difference into account.
- With very small modifications, the program can be made to change some of the its default behaviors:
- Copyright date is expected to be found in a footer table at the end of the HTML file. The program looks for a cell with the word "Copyright" in it, and the date is extracted from that cell.
- The copyright owner is defined in the write_header method.
- Similarly, any special license text (if needed) is defined in the write_header method.
- By default, the program expects to operate on clean xHTML, as produced by the DITA-OT. To operate manually edit HTML files, they should be run through an instance of tidy (available from w3c) to clean them up. Alternatively, the path to the tidy utility can be specified in the program (constant=TIDY). When the program is invoked with the -c option, that utility is invoked before processing the file.
- In this post, Don Day provides two good references for man page formats, including the Cover pages and the DocBook refentry markup.
- ToDo: Add entries for the table-macro processor and troff reference manuals that I found on the web, once upon a time.
The existing html2man wrapper script was written to run on Solaris. Environment variables like RUBY_HOME will need to be specified, but the script should run pretty much without change on any Unix-heritage system, including Solaris, Linux, and OS/X. A similar version is needed for Windows systems, since that environment is so radically different.
Currently, the standard troff table-macro processor is used to generate tables. That processor makes it easy to define tables, but it produces generally horrific results. (Unexpectedly, given the length of time it has been in existence.) It tends to make excessively wide columns for cells with little or nothing in them, while other columns with a lot of information in the cells are too narrow. And it frequently makes tables that are 120 columns wide, or wider, which isn't very helpful for man pages.
What's needed is a smart table processor that does "column weighting" such that:
- The columns with the most text get most width.
- Every column gets the minimum width it needs.
- The maximum table width is 80 columns.
- Column weighting is done with a two-pass algorithm, so all of the cells in the column are inspected to find out how much data the column contains. (The standard table-macro processor appears to base it's decision on the first row of cells, which is frequently misleading.)
In outline, the processing should work like this:
- Identify the maximum width for each column.
- Identify maximum table width (=sum of max. column widths plus boundary spaces and lines).
- If maximum table width is <= (80 - margin_indent), output the table in the normal indented position, using maximum column widths.
- Determine the minimum width for each column by finding out where each cell in it can be broken, breaking on spaces if possible, or hyphens or slashes, if available.
- If sum of minimum column widths (+boundaries) > (80 - margin_indent), start the table in column 1 and output using those widths.
- Otherwise, start the table in the normal indented position and weight the columns:
- Sum the total number of characters for each cell in the column.
- Take column-spanning cells into account. (The "weight" is the number of characters in the cell. For a cell that spans multiple columns, the weight is divided among those columns.)
- Identify the amount of available space = 80 - (sum of minimum column widths + boundaries).
- Set each column width to the minimum value for that column.
- Allocate remaining space proportionally to the columns, based on their weighting.
- Output the results as nroff text.
- These are design notes, captured here for convenience. Once implemented, they need to move into the source file, leaving only the summary behind ("Smart table-generation").
- These notes reflect thinking that went into a program I wrote once upon a time, to display HTML tables on ASCII-character terminals in government offices. So I know the strategy works. (And now that I've recovered the design, I'm confident that the algorithms could be implemented in a couple of weeks.)
- The existing code that uses table macros should be saved, in case an html2troff processor ever seems like a good idea. It may be that the table-macros work just great when processed by troff, rather than by nroff.
The tool was previously running on HTML files. So when files were edited, Rake's date comparisons told it which source files had changed. Rake then did "minimal builds", only regenerating man pages that had actually changed.
But when the DITA-OT runs, it regenerates everything. When Rake compares file dates to see what needs to be processed, every file is "newer", so every file gets converted, even if no content has actually changed. For greater efficiency, it would be nice to convert only those files that have actually changed. One way to do that is to generate files to a separate location and use rsync (with the right options) or a script that copies files to the source file hierarchy only if a diff utility shows real changes. Then Rake's date-comparisons will be valid. Another option is to extend Rake, creating a version of the task method that looks for real changes using file differencing (fdiff_task?), rather than using date comparisons.