GutenMark Usage Page
Attractively formatting Project Gutenberg texts
home
download
usage
FAQ
changelog

Contents

GutenMark Tutorial
Manual Tweaking
Features
Wish List
Software Developers

GutenMark Tutorial

GutenMark is a command-line utility, so you have to use it from the Win32 "MS-DOS Prompt" or from the Linux/UNIX/BSD/MacOS-X command shell:
GutenMark [options] [inputfile [outputfile]]
For example,
GutenMark tomco10.txt tomco10.html
Other possibilities are to use the program in "filter" style:
GutenMark tomco10.txt > tomco10.html
     or
GutenMark < tomco10.txt > tomco10.html
GutenMark is intended to be fully automatic, but there are a few command-line options
 
Option
Description
--debug
Creates a log file, GutenMark.log, from which certain internal operations of GutenMark can be examined.  It also causes the files GutenMark.native and GutenMark.foreign to be created; these are wordlists containing only the words that actually appear in the source file.  These supplementary output files are useful only for developers.
--help
Displays a list of the available options.
--latex
As an alternative to creating HTML output, there is an experimental patch (thanks to Joe Cherry!) for creating LaTeX output instead.  This is still quite buggy, but may produce some interesting results.
--no-diacritical
By default, GutenMark restores diacritical marks in words for which there is no native equivalent without diacritical marks.  For example, suppose the word "Fraulein" appears in an English-language etext.  This is not an English word.  In fact, it is not a word in any language.  The correct (German) word is "Fräulein".  This is a systematic problem that appears through almost all PG etexts.  GutenMark will notice this kind of thing, and try to restore the word to its proper form.  (This is a separate issue from italicizing the word as foreign -- see below.)  You can turn this feature off with the '--no-diacritical' command-line switch.
--no-foreign
By default, GutenMark attempts to italicize foreign words -- i.e., words not in the native language of the etext.  The '--no-foreign' command-line switch turns this feature off.
--no-justify
Outputs paragraphs in ragged-right format.  The default format is right justified.  This option is useful if the htmldoc utility is used to convert HTML to Postscript because htmldoc is (or has been) buggy in regard to right justification.  Or, I guess, if you just prefer ragged-right text. 
--no-mdash
By default, GutenMark replaces constructs like "--" with an mdash character.  This looks better when printed, but most browsers do a very poor job of rendering mdashes, so that HTML looks better with the original dashes in place.  The "--no-mdash" command-line option turns off the mdash conversion.
--profile=name
GutenMark uses wordlists and namelists to help it perform various tasks (such as identifying which words are in the native language of the etext and which are foreign).  A configuration file, GutenMark.cfg, lists the wordlists and defines their search ordering and native/foreign status.  The configuration file can contain multiple named profiles, perhaps representing different native languages.  The default profile is named 'english', but alternate profiles can be selected using the '--profile' command-line option.  If the specified profile is not found in the configuration file, GutenMark uses all wordlists and namelists it can find, in the following order:  namelist for name language, all other namelists, wordlist for name language, all other wordlists.  Note that using all wordlists and namelists can be quite time consuming, so defining a custom profile is generally a better idea.  The configuration file, as distributed, contains profiles "english" (using a small set of wordlists), "none" (using no wordlists), and "english_all" (using all wordlists).
--yes-header
By default, GutenMark removes Project Gutenberg's file header from the HTML output, in order to insure conformance with PG requirements.  The "--yes-header" command-line option causes the PG header to be retained.  You need to read the PG header and evaluate for yourself whether retention of the header is legal or desirable for your application.  (Removal of the header is guaranteed to be legal.)

Another thing you might want to do, of course, is to make a hardcopy of the reformatted etext.  You can do this by printing directly from your browser, but the typical browser does not do a great job of making the HTML (however well it has been created) print like a book.  Several options are available, such as loading the HTML into Microsoft Word, and printing it from there.  A better method is to use one of the freely available  HTML-to-Postscript conversion utilities to create a Postscript or PDF version of the book.  This is, perhaps, easier if you are a Linux/BSD user than if you are a Windows user.  To create the various PDF samples that appear on this website, I used (on Linux) the free utility html2ps, along with various custom configuration files that you can get from my download page.

Here is what the complete sequence of steps looked like, in Linux, for converting the sample etext to PDF format:

# Create HTML from the PG etext.
GutenMark bldhb10.txt bldhb10.html
# Create 8.5"x5.5" Postscript, hyphenated, from the HTML.
html2ps -H -f half12schoolbook bldhb10.html > bldhb10.ps
# Create PDF from the Postscript.
ps2pdf bldhb10.ps
Or, in Linux, we could simply have printed it rather than creating PDF, by replacing the final command with
# Print the Postscript file.
lpr bldhb10.ps
Another interesting thing you can do is to print in booklet format -- two pages on the front and two pages on the back of standard letter-sized paper, with the pages reordered so the whole mess can be folded or cut into half-letter sized pages.  This can be done with the freely available PSUtils tools.  In Linux, you'd replace the ps2pdf step with this:
# Form the Postscript pages into a "signature":
psbook bldhb10.ps signature.ps
# Combine the pages 2-up.
pstops "2:0L@1.0(8.5in,0)+1L@1.0(8.5in,5.5in)" signature.ps booklet.ps
# Pull off the odd-numbered 2-up pages, in reverse order.
psselect -o -r booklet.ps frontsides.ps
# Pull off the even-numbered 2-up pages, in normal order.
psselect -e booklet.ps backsides.ps
# Print it.
lpr frontsides.ps
... feed the paper back into the printer ...
lpr backsides.ps
With the GutenMark's "--latex" command-line switch, you also have the possibility of printing or converting etext using LaTeX.  I'll post an explanation of that here when I understand the possibilities better myself.


Manual Tweaking

GutenMark aims to provide a completely automatic system for formatting Project Gutenberg etexts.  At the same time, GutenMark is a program which is very new and not perfect.  Consequently, depending on your purpose in creating the formatted texts, you may desire to improve the results with a some manual tweaking of the HTML.  Generally, this will be a matter of scanning through the HTML quickly in a WYSIWYG editor (such as Netscape Composer or Microsoft Word), and quickly fixing the things that seem most objectionable to you.

With that in mind, here's a list of things that I find objectionable in GutenMark HTML output, roughly in descending order of importance.  I would hazard a guess that only the first two items are truly objectionable to most people.

  1. GutenMark does not produce a title page, copyright notice, etc.
  2. GutenMark is not perfect at deducing section headings.  The most common problem is lines that are falsely marked as headings when they are actually normal text.  This does not happen in most documents, but does happen in some documents.
  3. GutenMark is not perfect at distinguishing between prose and verse.  This can result in verse that is falsely formatted as a justified paragraph or, more commonly, as a ragged-right prose paragraph with shorter-than-average lines.  This commonly happens only a few times within a document, and is often not noticeable to the average reader.
  4. GutenMark is not perfect at distinguishing between native-language text and foreign text.  This commonly manifests itself either as proper names that are incorrectly identified as foreign words (and hence are italicized), or else as individual words in foreign phrases that are not identified as being foreign.  The latter problem results in occasional multi-word italicized foreign phrases having a few words that are not italicized.

Features

Here are some of the things GutenMark does:

Wish List

Some of the items below represent things that are merely hard to accomplish, whereas others are simply not possible because the information that would be needed to accomplish them is not present in the PG files.  But I still can wish ... In general, the closer the etext conforms to PG guidelines, the better GutenMark can handle it.


Software Developers

I really appreciate those who have contributed features or bug fixes to GutenMark, but I still haven't provided any systematic for you to do so.  If you have any such changes in hand, I'd suggest communicating them directly to me.

Oh, and I know that the code isn't very pretty.  I was really just throwing together a 'proof of concept', and it started being useful much more quickly than I thought it would, so it got a little out of hand.  Probably I'll pretty it up later.

Click here if you want to know more about how GutenMark works.


©2001 Ronald S. Burkey
Last updated 12/01/01 by RSB.  Contact me.