|
|
|
|
|
|
|
|
|
|
GutenMark Tutorial
Printing or converting to PDF
Manual Tweaking
GutenMark [options] [inputfile [outputfile ]]For example,
GutenMark tomco10.txt tomco10.htmlOther possibilities are to use the program in "filter" style:
GutenMark tomco10.txt > tomco10.htmlGutenMark is intended to be fully automatic, but there are quite a few command-line options that are available anyway.
or
GutenMark < tomco10.txt > tomco10.html
|
|
|
|
|
(20011122 and later.) Creates a log file, GutenMark.log, from which certain internal operations of GutenMark can be examined. It also causes the files GutenMark.native.gz and GutenMark.foreign.gz to be created; these are wordlists containing only the words that actually appear in the source file. These supplementary output files are useful only for developers. |
| --first-capital | (20011209 and later.) By default, if GutenMark finds the first word of a chapter in ALL-CAPS, it simply fixes the capitalization of the word. With the '--first-capital' option, it instead allows the first word of the chapter to remain in ALL-CAPS. (However, it does not convert such a word to ALL-CAPS if it is not already.) Cannot be used with the '--first-italics' switch. |
| --first-italics | (20011209 and later.) By default, if GutenMark finds the first word of a chapter in ALL-CAPS, it does not italicize the word as it would with other ALL-CAPS words. With the '--first-italics' option, it treats the first word of the chapter just like all other words and italicizes it if it had been in ALL-CAPS. Cannot be used with the '--first-capital' switch. |
| --force-numeric
--force-symbolic |
(20011210 and later.) You can use this switch if your browser displays special characters (like soft hyphens) in a funky way. Explanation: special characters are encoded in HTML either symbolically (such as "‘" for a left single-quote) or numerically (such as "‘" for a left single-quote). The symbolic form is easier for people who want to read the raw HTML (rather than just viewing it in a browser) and is perfectly correct -- but browsers are more consistent at supporting the numeric form. By default, the numeric form is used; the '--force-symbolic' switch can change to the symbolic form instead, and would be recommended if you intended to add manual markups to the HTML. (The '--force-numeric' switch actually has no use an any released version, and is present only for development purposes, but it doesn't hurt to use it.) |
|
|
Displays a list of the available options. |
|
|
(20011126 and later.) As an alternative to creating HTML output, there is an experimental patch (thanks to Joe Cherry!) for creating LaTeX output instead. This is still quite buggy, but may produce some interesting results. |
|
|
(20011125 and later.) By default, GutenMark restores diacritical marks in words for which there is no native equivalent without diacritical marks. For example, suppose the word "Fraulein" appears in an English-language etext. This is not an English word. In fact, it is not a word in any language. The correct (German) word is "Fräulein". This is a systematic problem that appears through almost all PG etexts. GutenMark will notice this kind of thing, and try to restore the word to its proper form. (This is a separate issue from italicizing the word as foreign -- see below.) You can turn this feature off with the '--no-diacritical' command-line switch. |
|
|
(20011125 and later.) By default, GutenMark attempts to italicize foreign words -- i.e., words not in the native language of the etext. The '--no-foreign' command-line switch turns this feature off. |
|
|
Outputs paragraphs in ragged-right format. The default format is right justified. This option is useful if the htmldoc utility is used to convert HTML to Postscript because htmldoc is (or has been) buggy in regard to right justification. Or, I guess, if you just prefer ragged-right text. |
|
|
(20011109 and later.) By default, GutenMark replaces constructs like "--" with an mdash character. This looks better when printed, but most browsers do a very poor job of rendering mdashes, so that HTML looks better with the original dashes in place. The "--no-mdash" command-line option turns off the mdash conversion. |
|
|
(20011122 and later.) GutenMark uses wordlists and namelists to help it perform various tasks (such as identifying which words are in the native language of the etext and which are foreign). A configuration file, GutenMark.cfg, lists the wordlists and defines their search ordering and native/foreign status. The configuration file can contain multiple named profiles, perhaps representing different native languages. The default profile is named 'english', but alternate profiles can be selected using the '--profile' command-line option. If the specified profile is not found in the configuration file, GutenMark uses all wordlists and namelists it can find, in the following order: namelist for name language, all other namelists, wordlist for name language, all other wordlists. Note that using all wordlists and namelists can be quite time consuming, so defining a custom profile is generally a better idea. The configuration file, as distributed, contains profiles "english" (using a small set of wordlists), "none" (using no wordlists), and "english_all" (using all wordlists). |
| --single-space | (20011210 and later.) By default two blank spaces are used between sentences or after colons, which is standard editorial practice (at least, in American English). By user request, the '--single-space' command-line switch has been added to reduce this to a single blank space instead. |
|
|
(20011112 and later.) By default, GutenMark removes Project Gutenberg's file header from the HTML output, in order to insure conformance with PG requirements. The "--yes-header" command-line option causes the PG header to be retained. You need to read the PG header and evaluate for yourself whether retention of the header is legal or desirable for your application. (Removal of the header is guaranteed to be legal.) |
Here is what the complete sequence of steps looked like, in Linux, for converting the sample etext to PDF format:
# Create HTML from the PG etext.Or, in Linux, we could simply have printed it rather than creating PDF, by replacing the final command with
GutenMark bldhb10.txt bldhb10.html
# Create 8.5"x5.5" Postscript, hyphenated, from the HTML.
html2ps -H -f half12schoolbook.rc bldhb10.html > bldhb10.ps
# Create PDF from the Postscript.
ps2pdf bldhb10.ps
# Print the Postscript file.Another interesting thing you can do is to print in booklet format -- two pages on the front and two pages on the back of standard letter-sized paper, with the pages reordered so the whole mess can be folded or cut into half-letter sized pages. This can be done with the freely available PSUtils tools. In Linux, you'd replace the ps2pdf step with this:
lpr bldhb10.ps
# Form the Postscript pages into a "signature":With the GutenMark's "--latex" command-line switch, you also have the possibility of printing or converting etext using LaTeX. I'll post an explanation of that here when I understand the possibilities better myself.
psbook bldhb10.ps signature.ps
# Combine the pages 2-up.
pstops "2:0L@1.0(8.5in,0)+1L@1.0(8.5in,5.5in)" signature.ps booklet.ps
# Pull off the odd-numbered 2-up pages, in reverse order.
psselect -o -r booklet.ps frontsides.ps
# Pull off the even-numbered 2-up pages, in normal order.
psselect -e booklet.ps backsides.ps
# Print it.
lpr frontsides.ps
... feed the paper back into the printer ...
lpr backsides.ps
With that in mind, here's a list of things that I find objectionable in GutenMark HTML output, roughly in descending order of importance. I would hazard a guess that only the first two items are truly objectionable to most people.
©2001 Ronald S. Burkey
Last updated 12/28/01 by RSB. Contact me
.