GutenMark
Usage Page
Attractively formatting Project
Gutenberg texts
Contents
GutenMark Tutorial
Features
Wish List
Software Developers
GutenMark Tutorial
GutenMark is a command-line utility, so you have to use it from the Win32
"MS-DOS Prompt" or from the Linux/UNIX/BSD/MacOS-X command shell:
GutenMark [options] [inputfile [outputfile]]
For example,
GutenMark tomco10.txt tomco10.html
Other possibilities are to use the program in "filter" style:
GutenMark tomco10.txt >tomco10.html
or
GutenMark <tomco10.txt >tomco10.html
Because GutenMark is intended to be fully automatic, there are very
few command-line options:
Option
|
Description
|
--help
|
Displays a list of the available options. |
--no-justify
|
Outputs paragraphs in ragged-right format. The default format
is right-justified. This option is useful if the htmldoc utility
is used to convert HTML to Postscript because htmldoc is (or has
been) buggy in regard to right-justfication. Or, I guess, if you
just prefer ragged-right text. |
--no-mdash
|
By default, GutenMark replaces constructs like "--" with an
mdash
character. This looks better when printed, but most browsers do a
very poor job of rendering mdashes, so that HTML looks better with the
original dashes in place. The "--no-mdash" command-line option turns
off the mdash conversion. |
--yes-header
|
By default, GutenMark removes the Project Gutenberg file-header
from the HTML output, in order to insure conformance with PG requirements.
The "--yes-header" command-line option causes the PG header to be retained.
You need to read the PG header and evaluate for yourself whether retention
of the header is legal or desirable for your application. (Removal
of the header is guaranteed to be legal.) |
Another thing you might want to do is, of course, to make a hardcopy
of the book. You can do this by printing directly from your browser,
but the typical browser does not do a great job of making the HTML (however
well it has been created) print like a book. Several options are
available, such as loading the HTML into Microsoft Word, and printing it
from there. A better method is to use one of the freely available
HTML-to-Postscript conversion utilities to create a Postscript or PDF version
of the book. This is, perhaps, easier if you are a Linux/BSD user
than if you are a Windows user. To create the PDF
sample text, on Linux, I used the free utility html2ps, along
with a custom configuration file named "html2psrc" that you can download
by clicking here.
Here is what the complete sequence of steps looked like, in Linux, for
converting the sample etext to PDF format:
GutenMark bldhb10.txt bldhb10.html
html2ps -f html2psrc bldhb10.html >bldhb10.ps
ps2pdf bldhb10.ps
Or, in Linux, we could simply have printed it rather than creating PDF,
by replacing the final command with
lpr bldhb10.ps
Features
Here are some of the things GutenMark does:
-
Tries to deduce the title and author.
-
Identifies the Project Gutenberg "fine print" header and, by default, removes
it. At your option, it can also retain the header, but does not attempt
to reformat it. The header will appear in a fixed-width font, unlike
the remainder of the text.
-
Usually, a PG etext will begin with items like title pages, tables of contents,
notes from the person who created the etext, and so forth. These
materials differ in format from etext to etext, and follow no obvious rules.
GutenMark,
tries to identify this section, which it entitles "Prefatory Materials",
and performs only minor reformatting on it.
-
Adds "smart quotes".
-
Adds headings to chapters, sections, etc.
-
Identifies paragraphs, and joins together the lines of the paragraph, so
that word-wrapping can be used. Paragraphs are right-justified, by
default.
-
Distinguishes word-wrapped areas from verse.
-
PG etexts are highly inconsistent in their handling of italicized text.
Many etexts simply discard that information. Others mark italicized
text in some ways, but that marking differs from etext to etext.
Here are some of the italicizing methods that GutenMark recognizes:
_italicized_, <i>italicized</i>, /italicized/, ~~italicized~~, <italicized>.
-
GutenMark also automatically italicizes certain words like
"etc.", "viz.", "i.e.", and so on.
-
Locates ends of sentences and colons, so that they can be followed
by two spaces rather than one. Automatically recognizes that honorifics
like "Mr. Smith" aren't ends of sentences, and that sentences may
be in quotations. It recognizes that constructs like "929 N. Durello"
are not the ends of sentences.
-
Handles dangling hyphens at the ends of lines, so that they are not followed
by spurious spaces.
-
Can usually markup centered lines. (Though Project Gutenberg frowns
on centered text, a lot of folks use it anyhow.)
-
There are no practical limitations in terms of file-sizes.
-
Only a miniscule subset of HTML is used, so the marked-up files should
have maximum portability.
-
Traditionally, PG etexts have used so-called "7-bit" ASCII, but lately
a number of "8-bit" ASCII texts have shown up. These 8-bit files
more accurately represent the diacritical marks found in non-English texts.
For example, 'ü' in an 8-bit etext shows up merely as 'u' in a 7-bit
etext. GutenMark is able to handle both.
Wish List
Some of the items below represent things that are merely hard to accomplish,
whereas others are simply not possible because the information that would
be needed to accomplish them is not present in the PG files. But
I still can wish ...
-
Most of the processing in GutenMark is actually devoted just to
determining the location of section headings and verse. Frankly,
in spite of this, it could still be improved a lot!
-
The italicization style of all-caps: ITALICIZED. Consider the
examples EVIL, JOHN, and NASA. These should be rendered evil
(or perhaps Evil), John, and NASA. There may or may
not be clues in the text to determine which is which. A minimally
adequate approach would require a large spelling dictionary. Presently,
GutenMark
just leaves them as all-caps.
-
Restoration of Greek transliterated to Latin, back into Greek. In
some PG etexts, Greek text is simply discarded (and obviously cannot be
recovered). In other cases it has been transliterated to Latin characters,
but there are various schemes for doing so, and these are seldom specified.
Furthermore, the transliterated text is often not marked in any way as
being Greek.
-
Removal of soft-hyphens. For example, suppose one line of the etext
ended with "soft-", and the next line began with "hyphen". Should
this be treated as "soft-hyphen" or as "softhyphen"? There's no easy
way to know by contextual clues, so a solution would probably require a
large spelling dictionary.
-
Restoration of missing diacritical marks. For example, if we encounter
the word "role" in a 7-bit ASCII text, should it be converted to "rôle"?
-
Footnotes/endnotes. Innumerable footnote/endnote styles appear in
PG etexts. Sometimes footnotes are just discarded. Sometimes
they are embedded directly in the text. Sometimes they appear at
the ends of paragraphs. Sometimes at the ends of chapters.
Sometimes at the ends of the book. When they do appear, their markings
are highly inconsistent. Sometimes they're enclosed in brackets.
Sometimes they're marked with "*", "**", etc. Sometimes with numbers,
like "[53]" or "[FN#53]" or "{#53}". (I could continue, but you get
the idea).
-
Use of "-" where "--" was actually intended.
-
Dealing with things like "right-" when appearing at the end of the line,
as (for example) in the phrase "this happens with both the right- and left-hand
versions." GutenMark would threat this as "this happens with
both the right-and left-hand versions."
-
Tabular data. GutenMark actually makes some attempt
to detect tables, and when it does so it renders them in a fixed-width
font that allows the columns to line up. However, it could do a much
better job of detecting tables, and it could render them as actual HTML
tables.
-
Double-column verse.
-
Attributions. By this, I mean quotes which are set off from the surrounding
text, and which are followed by the author's name (which is supposed to
be at the far right of the quotation).
-
Spacing in verse or dramatic scripts. Verse and scripts (like plays)
are depicted in a variable-width font, and this may result in incorrect
alignment among successive lines. Consider the following example,
that might appear in a play, in which several characters respond simultaneously
to another character:
( Nonsense!
| You can't be serious!
I've decided to leave you! { What!
| You don't have the nerve!
( That's crazy talk!
The intention of the person creating the etext was clearly that a single
large left-hand brace should precede the text at the right. GutenMark,
however, will not only not add a large brace, but will
jumble up the spacing so that it doesn't even look as good as it does here.
-
Illustrations. Well, PG etexts don't have illustrations. But
still ...
-
Bullets. I haven't seen many bullets in PG etexts, but I'm sure GutenMark
won't handle them.
In general, the closer the etext conforms to PG guidelines, the better
GutenMark
can handle it.
Software Developers
I don't know at the time this is being written whether anyone will want
to contribute features or bug-fixes to GutenMark, so I haven't really
allowed any way to do it. If you want to do so, I'd suggest communicating
the changes directly to me.
Oh, and I know that the code isn't very pretty. I was really just
throwing together a proof-of-concept, and it started being useful much
more quickly than I thought it would. Perhaps I'll pretty it up later.
Last updated 11/13/01 by RSB. Contact me.