GutenMark
Usage Page
Attractively formatting Project
Gutenberg texts
Contents
GutenMark Tutorial
Manual Tweaking
Features
Wish List
Software Developers
GutenMark Tutorial
GutenMark is a command-line utility, so you have to use it from the Win32
"MS-DOS Prompt" or from the Linux/UNIX/BSD/MacOS-X command shell:
GutenMark [options] [inputfile [outputfile]]
For example,
GutenMark tomco10.txt tomco10.html
Other possibilities are to use the program in "filter" style:
GutenMark tomco10.txt > tomco10.html
or
GutenMark < tomco10.txt > tomco10.html
GutenMark is intended to be fully automatic, but there are a few
command-line options
|
Option
|
Description
|
|
--debug
|
Creates a log file, GutenMark.log, from which certain internal operations
of GutenMark can be examined. It also causes the files GutenMark.native
and GutenMark.foreign to be created; these are wordlists containing only
the words that actually appear in the source file. These supplementary
output files are useful only for developers. |
|
--help
|
Displays a list of the available options. |
|
--latex
|
As an alternative to creating HTML output, there is an experimental
patch (thanks to Joe Cherry!) for creating LaTeX output instead.
This is still quite buggy, but may produce some interesting results. |
|
--no-diacritical
|
By default, GutenMark restores diacritical marks in words for
which there is no native equivalent without diacritical marks. For
example, suppose the word "Fraulein" appears in an English-language etext.
This is not an English word. In fact, it is not a word in any language.
The correct (German) word is "Fräulein". This is a systematic
problem that appears through almost all PG etexts. GutenMark will
notice this kind of thing, and try to restore the word to its proper form.
(This is a separate issue from italicizing the word as foreign -- see below.)
You can turn this feature off with the '--no-diacritical' command-line
switch. |
|
--no-foreign
|
By default, GutenMark attempts to italicize foreign words --
i.e., words not in the native language of the etext. The '--no-foreign'
command-line switch turns this feature off. |
|
--no-justify
|
Outputs paragraphs in ragged-right format. The default format
is right justified. This option is useful if the htmldoc utility
is used to convert HTML to Postscript because htmldoc is (or has
been) buggy in regard to right justification. Or, I guess, if you
just prefer ragged-right text. |
|
--no-mdash
|
By default, GutenMark replaces constructs like "--" with an
mdash
character. This looks better when printed, but most browsers do a
very poor job of rendering mdashes, so that HTML looks better with the
original dashes in place. The "--no-mdash" command-line option turns
off the mdash conversion. |
|
--profile=name
|
GutenMark uses wordlists and
namelists
to help it perform various tasks (such as identifying which words are in
the native language of the etext and which are foreign). A configuration
file, GutenMark.cfg, lists the wordlists and defines their search ordering
and
native/foreign status. The configuration file can contain multiple
named profiles, perhaps representing different native languages.
The default profile is named 'english', but alternate profiles can be selected
using the '--profile' command-line option. If the specified profile
is not found in the configuration file, GutenMark uses all wordlists
and namelists it can find, in the following order: namelist for name
language, all other namelists, wordlist for name language, all other
wordlists. Note that using all wordlists and namelists can
be quite time consuming, so defining a custom profile is generally a better
idea. The configuration file, as distributed, contains profiles "english"
(using a small set of wordlists), "none" (using no wordlists), and "english_all"
(using all wordlists). |
|
--yes-header
|
By default, GutenMark removes Project Gutenberg's file header
from the HTML output, in order to insure conformance with PG requirements.
The "--yes-header" command-line option causes the PG header to be retained.
You need to read the PG header and evaluate for yourself whether retention
of the header is legal or desirable for your application. (Removal
of the header is guaranteed to be legal.) |
Another thing you might want to do, of course, is to make a hardcopy
of the reformatted etext. You can do this by printing directly from
your browser, but the typical browser does not do a great job of making
the HTML (however well it has been created) print like a book. Several
options are available, such as loading the HTML into Microsoft Word, and
printing it from there. A better method is to use one of the freely
available HTML-to-Postscript conversion utilities to create a Postscript
or PDF version of the book. This is, perhaps, easier if you are a
Linux/BSD user than if you are a Windows user. To create the various
PDF samples that appear on this website, I used (on Linux) the free utility
html2ps,
along with various custom configuration files that you can get from my
download
page.
Here is what the complete sequence of steps looked like, in Linux, for
converting the sample etext to PDF format:
# Create HTML from the PG etext.
GutenMark bldhb10.txt bldhb10.html
# Create 8.5"x5.5" Postscript, hyphenated, from the HTML.
html2ps -H -f half12schoolbook bldhb10.html > bldhb10.ps
# Create PDF from the Postscript.
ps2pdf bldhb10.ps
Or, in Linux, we could simply have printed it rather than creating PDF,
by replacing the final command with
# Print the Postscript file.
lpr bldhb10.ps
Another interesting thing you can do is to print in booklet format -- two
pages on the front and two pages on the back of standard letter-sized paper,
with the pages reordered so the whole mess can be folded or cut into half-letter
sized pages. This can be done with the freely available PSUtils
tools. In Linux, you'd replace the ps2pdf step with this:
# Form the Postscript pages into a "signature":
psbook bldhb10.ps signature.ps
# Combine the pages 2-up.
pstops "2:0L@1.0(8.5in,0)+1L@1.0(8.5in,5.5in)" signature.ps booklet.ps
# Pull off the odd-numbered 2-up pages, in reverse order.
psselect -o -r booklet.ps frontsides.ps
# Pull off the even-numbered 2-up pages, in normal order.
psselect -e booklet.ps backsides.ps
# Print it.
lpr frontsides.ps
... feed the paper back into the printer ...
lpr backsides.ps
With the GutenMark's "--latex" command-line switch, you also have
the possibility of printing or converting etext using LaTeX. I'll
post an explanation of that here when I understand the possibilities better
myself.
Manual Tweaking
GutenMark aims to provide a completely automatic system for formatting
Project Gutenberg etexts. At the same time, GutenMark is a
program which is very new and not perfect. Consequently, depending
on your purpose in creating the formatted texts, you may desire to improve
the results with a some manual tweaking of the HTML. Generally, this
will be a matter of scanning through the HTML quickly in a WYSIWYG editor
(such as Netscape Composer or Microsoft Word), and quickly fixing the things
that seem most objectionable to you.
With that in mind, here's a list of things that I find objectionable
in GutenMark HTML output, roughly in descending order of importance.
I would hazard a guess that only the first two items are truly objectionable
to most people.
-
GutenMark does not produce a title page, copyright notice, etc.
-
GutenMark is not perfect at deducing section headings. The
most common problem is lines that are falsely marked as headings when they
are actually normal text. This does not happen in most documents,
but does happen in some documents.
-
GutenMark is not perfect at distinguishing between prose and verse.
This can result in verse that is falsely formatted as a justified paragraph
or, more commonly, as a ragged-right prose paragraph with shorter-than-average
lines. This commonly happens only a few times within a document,
and is often not noticeable to the average reader.
-
GutenMark is not perfect at distinguishing between native-language
text and foreign text. This commonly manifests itself either as proper
names that are incorrectly identified as foreign words (and hence are italicized),
or else as individual words in foreign phrases that are not identified
as being foreign. The latter problem results in occasional multi-word
italicized foreign phrases having a few words that are not italicized.
Features
Here are some of the things GutenMark does:
-
Tries to deduce the title and author.
-
Identifies the Project Gutenberg "fine print" header and, by default, removes
it. At your option, it can also retain the header, but does not attempt
to reformat it. The header will appear in a fixed-width font, unlike
the remainder of the text.
-
Usually, a PG etext will begin with items like title pages, tables of contents,
notes from the person who created the etext, and so forth. These
materials differ in format from etext to etext, and follow no obvious rules.
GutenMark,
tries to identify this section, which it entitles "Prefatory Materials",
and performs only minor reformatting on it.
-
Adds "smart quotes".
-
Adds headings to chapters, sections, etc.
-
Identifies paragraphs, and joins together the lines of the paragraph, so
that word wrapping can be used. Paragraphs are right justified, by
default.
-
Distinguishes word-wrapped areas from verse.
-
PG etexts are highly inconsistent in their handling of italicized text.
Many etexts simply discard that information. Others mark italicized
text in some ways, but that marking differs from etext to etext.
Here are some of the italicizing methods that GutenMark recognizes
and handles: _italicized_, <i>italicized</i>, /italicized/,
~~italicized~~, <italicized>, ITALICIZED.
-
GutenMark automatically italicizes certain words like "etc.",
"viz.", "i.e.", and so on. When wordlists
are used, it by default italicizes all words which it can identify as being
in a foreign language -- i.e., a language other than the native language
of the etext.
-
When wordlists with built-in soft-hyphens are used (presently, only the
Norwegian wordlist), text can be automatically hyphenated when (or if)
HTML is converted to Postscript. Or, post-processing software (like
html2ps)
may be able to use TeX hyphenation files.
-
Locates ends of sentences and colons, so that they can be followed
by two spaces rather than one. Automatically recognizes that honorifics
like "Mr. Smith" aren't ends of sentences, and that sentences may
be in quotations. It recognizes that constructs like "929 N. Durello"
are not the ends of sentences.
-
Handles dangling hyphens at the ends of lines, so that they are not followed
by spurious spaces.
-
Can usually markup centered lines. (Though Project Gutenberg frowns
on centered text, a lot of folks use it anyhow.)
-
There are no practical limitations in terms of file sizes.
-
Only a minuscule subset of HTML is used, so the marked-up files should
have maximum portability.
-
Traditionally, PG etexts have used so-called "7-bit" ASCII, but lately
a number of "8-bit" ASCII texts have shown up. These 8-bit files
more accurately represent the diacritical marks found in non-English texts.
For example, 'ü' in an 8-bit etext shows up merely as 'u' in a 7-bit
etext. GutenMark is able to handle both.
-
GutenMark can also, to some extent, restore the diacritical marks
which are not present at all in 7-bit ASCII etexts. For example,
if we encounter the word "role" in a 7-bit English-language ASCII text,
it will be converted to "rôle".
-
Experimental LaTeX support has been added, providing an alternative to
HTML output.
Wish List
Some of the items below represent things that are merely hard to accomplish,
whereas others are simply not possible because the information that would
be needed to accomplish them is not present in the PG files. But
I still can wish ...
-
Most of the processing in GutenMark is actually devoted just to
determining the location of section headings and verse. Frankly,
in spite of this, it could still be improved a lot!
-
So could the accuracy of determining the author and title of the book.
-
Language-dependent formatting. The rules used by GutenMarkare
appropriate for the Project Gutenberg etexts existing currently (2001):
namely, primarily 19th-century English and American fiction. This
affects the characters used for quotation marks, and possibly other formatting
characteristics.
-
Automatic detection of the etext's native language.
-
All wordlists need to be extended by adding more words. The French
wordlist seems pretty good, while the English, German, and Latin wordlists
need improvement. (I have no observations on the other wordlists
at present.)
-
Except for the Norwegian wordlist, which already has them, it might be
worthwhile to add soft-hyphens to the wordlists. These can
be used for automatic hyphenation. [NOTE: Even without
soft-hyphens in the wordlists, post-processing programs may be able to
perform hyphenation. For example, html2ps can use TeX hyphenation
files to obtain the necessary data.]
-
Restoration of missing currency symbols, particularly Pound (£) and
Yen (¥).
-
Restoration of Spanish inverted exclamation points (¡) and question
marks (¿).
-
Restoration of Greek transliterated to Latin, back into Greek. In
some PG etexts, Greek text is simply discarded (and obviously cannot be
recovered). In other cases it has been transliterated to Latin characters,
but there are various schemes for doing so, and these are seldom specified.
Furthermore, the transliterated text is often not marked in any way as
being Greek.
-
Removal of false hard-hyphens. For example, suppose one line of the
etext ended with "soft-", and the next line began with "hyphen".
Should this be treated as "soft-hyphen" or as "softhyphen"?
-
Footnotes/endnotes. Innumerable footnote/endnote styles appear in
PG etexts. Sometimes footnotes are just discarded. Sometimes
they are embedded directly in the text. Sometimes they appear at
the ends of paragraphs. Sometimes at the ends of chapters.
Sometimes at the ends of the book. When they do appear, their markings
are highly inconsistent. Sometimes they're enclosed in brackets.
Sometimes they're marked with "*", "**", etc. Sometimes with numbers,
like "[53]" or "[FN#53]" or "{#53}". (I could continue, but you get
the idea).
-
Use of "-" where "--" was actually intended.
-
Dealing with things like "right-" when appearing at the end of the line,
as (for example) in the phrase "this happens with both the right- and left-hand
versions." GutenMark would threat this as "this happens with
both the right-and left-hand versions."
-
Tabular data. GutenMark actually makes some attempt
to detect tables, and when it does so it renders them in a fixed-width
font that allows the columns to line up. However, it could do a much
better job of detecting tables, and it could render them as actual HTML
tables.
-
Double-column verse.
-
Attributions. By this, I mean quotes which are set off from the surrounding
text, and which are followed by the author's name (which is supposed to
be at the far right of the quotation).
-
Spacing in verse or dramatic scripts. Verse and scripts (like plays)
are depicted in a variable-width font, and this may result in incorrect
alignment among successive lines. Consider the following example,
that might appear in a play, in which several characters respond simultaneously
to another character:
( Nonsense!
| You can't be serious!
I've decided to leave you! { What!
| You don't have the nerve!
( That's crazy talk!
The intention of the person creating the etext was clearly that a single
large left-hand brace should precede the text at the right. GutenMark,
however, will not only not add a large brace, but will
jumble up the spacing so that it doesn't even look as good as it does here.
-
Illustrations. Well, PG etexts don't have illustrations. But
still ...
-
Bullets. I haven't seen many bullets in PG etexts, but I'm sure GutenMark
won't handle them.
-
Italicizing titles of works, such as "I looked it up in the Oxford English
Dictionary." This requires the addition of a database of titles
of printed works.
In general, the closer the etext conforms to PG guidelines, the better
GutenMark
can handle it.
Software Developers
I really appreciate those who have contributed features or bug fixes to
GutenMark,
but I still haven't provided any systematic for you to do so. If
you have any such changes in hand, I'd suggest communicating them directly
to me.
Oh, and I know that the code isn't very pretty. I was really just
throwing together a 'proof of concept', and it started being useful much
more quickly than I thought it would, so it got a little out of hand.
Probably I'll pretty it up later.
Click here if you want to know more about
how GutenMark works.
©2001 Ronald S. Burkey
Last updated 12/01/01 by RSB. Contact me.