GutenMark Bug/Issue List
Attractively formatting Project Gutenberg texts


home
features
download
usage
FAQ
changes
bugs
links
developer
Ladders, by Lynnie Rothan
GutenMark has no formal bug-tracking system (the level of community interest not having justified it as of yet), but here's a simple table which I'll use to record outstanding issues (including any you tell me about), and their resolutions.
# Date posted Status
Description of bug or issue
82 07/14/02 Closed Changes in constants used by glob.h cause the program not to compile in some newer versions of FreeBSD.
81 07/14/02 To-do The  Win32 version and *nix versions do not agree in their treatments of the initial line of the sample etext.  (But do treat the remainder of the sample etext identically.)
80 07/13/02 Look at me! The documentation I've provided about configuring wordlists has been wrong and misleading.  I have stupidly stated that wordlists are supposed to be stored in the directory containing the GutenMark executable and configuration files.  While they can be stored there, the default configuration file assumes rather that they are stored in the current directory, and consequently none of the wordlists will be found (using the default configuration file) if the program is not run from within the directory where it resides.

Workaround:  Please edit the configuration file to show the exact pathnames of the wordlists/namelists.

79 07/10/02 Closed 07/14/02. In Win32, if GutenMark is not run from within the directory where it lives, then its configuration file may not be found.  If the configuration file is found, then wordlists won't be found.   Several problems conspire to produce this effect:
  1. Win32 automatically adds an extension of ".EXE" when it reports the program name to GutenMark (as argv[0]), causing the configuration file to be GutenMark.EXE.cfg rather than GutenMark.cfg.
  2. Win32 automatically butchers the directory names in reporting them to GutenMark (as argv[0]) by shortening them to 8 characters as things like "GUTENM~1".
  3. My Win32 version of glob seems not to work except within the current directory.
  4. Finally -- not really a bug, but nevertheless a related issue of which many users won't be aware -- the wordlists listed within the configuration file must have their full pathnames rather than the relative pathnames found in the default configuration file.
Thanks to John Wells for reporting this problem.
78 06/16/02 To-do. In LaTeX, verse is rendered poorly (relative to the way it is rendered in HTML).  If paragraphs are not indented (the default), there is an extra blank line in between every line of verse.  If paragraphs are indented (--no-parskip), these blank lines don't appear, but if the verse is the first thing in the chapter the first verse line is not aligned with the others.
77 06/16/02 Closed. In LaTeX output, italicized text inserted by GutenMark -- for example, all-caps converted to upper&lower case italics -- is omitted.
76 06/15/02 Closed 06/16/02. LaTeX page headings show incorrect chapter names.
75 01/24/02 Needs investigation. (Thanks to Curtis Weyant.)   There is apparently a problem (e.g., lkhst10.txt) when the first lines of paragraphs are not indented, but the subsequent lines are; these are treated as verse by GutenMark.  (Yikes!  I never saw such a thing before.)
74 01/24/02 Under consideration (Suggestion thanks to Curtis Weyant.)  Provision might be made for a list of words which are never capitalized, except at the beginnings of sentences.
73 01/24/02 To-do. (Suggestion thanks to Curtis Weyant.)   Conversion of ALL-CAPS headings to upper/lower case (perhaps as a command-line option) would be useful.
72 12/28/01 To do. For non-PG etexts, the same means of deducing title and author cannot be used as for PG etexts.  Currently, non-PG title and author are left blank.
71 12/27/01 To do. For OCR'd text that hasn't been proofread well, it is common to find that the OCR software has inserted a '~' character wherever it does not reconize a character.  If this is the first character in a word, it will toggle italics mode on (see issue #64).   Therefore, for the special case of ~italicizing~, GutenMark needs to look for a trailing ~ before toggling italics on.
70 12/27/01 Closed GutenMark does not work with otherwise-suitable plain-vanilla ASCII etexts that don't have a PG header/footer.
69 12/20/01 Probably needs AI. In ytagn10.txt, there is a section titled 
'273'
Not surprisingly, this isn't recognized as a section heading.
68 12/20/01 Probably needs AI. In ytagn10.txt, for the first time, we see a section that has subsections. GutenMark marks the first as a sub-heading, but cannot distinguish any of the rest from normal text.
67 12/20/01 We'll see ... In ytagn10.txt, we find "o^" and "e^", presumably intended to be 'ô' and 'ê'.  I'll have to find this same construction in other files before applying a fix in GutenMark for it.  For reasons I don't quite grasp at this moment, this etext also encodes 'ç' as character #135, which doesn't correspond to anything in any character encoding I'm familiar with.
66 12/18/01 Probably impossible currently (See also issue #32.)  There are many characters which don't appear in the HTML 4.0 character-entity set at all.  Consider, for example, the 6 different regional encodings used by NIMA, as compared to the HTML 4.0 entities.  While there is a substantial (or complete, in some cases) overlap for characters 'a'-'z', 'A'-'Z', and 192-255, there are also many characters simply missing.  This is probably not an issue for English-language (or at least, American) readers, but still ...

Various issues make this very difficult.  Probably, unicode is necessary.  Even where browsers have fairly good unicode support, equal support is not available in the HTML-to-Postscript conversion (if used).  Then, too, adding unicode support within GutenMark would be a pretty substantial undertaking ...

65 12/18/01 Closed The simple categorization of wordlists as "foreign" or "native" needs to be made more subtle.  This is most easily understood in terms of the French namelist.  In an English text, French names would need to be treaed as "native" if encountered by themselved, but as "foreign" in the context of a foreign phrase.  Currently, they could only be treated as one or the other (not both) on the basis of the GutenMark.cfg file.  The same principle, of course, applies to any proper names (people, places, etc.).  Resolution:  The fix applied for problem #27 should fix this as well.
64 12/16/01 Closed The following additional emphasizing markups (beyond those already supported) were mentioned on the gutvol-d newsgroup.  Whether any or all of them are used in PG etexts, I can't say, but I guess they should be supported:
  • *emphasized*
  • ~emphasized~
  • _/emphasized/_
  • _*emphasized*_
  • */emphasized/*
  • _*/emphasized/*_
  • /:emphasized:/
  • |:emphasized:|
63 12/16/01 Closed In automatic conversion of 7-bit ASCII to 8-bit ASCII, the HTML may contain 8-bit codes rather than HTML character entities.
62 12/16/01 To do A couple of cases (thdvn10.txt) in which the program is fooled into treating verse as  a blockquote:
  1. Typo in which one line of a stanza does not begin with a capital.
  2. A verse beginning with "----".
61 12/16/01 May be impossible Blockquotes in which the volunteer has used abnormally short lines are indistinguishable from verse, and hence are not wrapped.  Numerous examples appear in thdvn10.txt.
60 12/16/01 Closed Found numerous instances (in thdvn10.txt), in which blockquotes with leading or trailing lines that were indented oddly would be treated as centered text rather than as blockquotes.
59 12/15/01 To-do Question:  should mdashes surrounded by whitespace be normalized by removing the whitespace?
58 12/15/01 Closed Found cases in wuthr10.txt in which mdashes at the ends of paragraphs would appear after the paragraph's closing tag.  Apparently introduced when dealing with issue #42.
57 12/15/01 Closed Found instances in benhr10.txt in which centered paragraphs were begun, but had no closing tags.
56 12/13/01 To-do Normally, "I" is not italicized.  However, if part of an all-caps phrase, like "I AM  THE LIGHT", it should be.
55 12/13/01 Possible Line drawings may now be recognizable (see issue #50), but they are merely converted to a fixed-width font, and not to an attractive drawing with lines that join up nicely.  NOTE:  Some browsers (like Mozilla) do support unicode line-drawing characters, but html2ps doesn't currently support them.
54 12/11/01 Closed In benhr10.txt, the name "Iras" incorrectly turns into "irás".
53 12/11/01 Closed In benhr10.txt, footnotes are preceded and following by a short line of dashes.  These are now incorrectly joined together with the footnote.  In other words
--------------
* This is my footnote
--------------
turns into
-------------- * This is my footnote --------------
52 12/11/01 Closed Another strange artifact in benhr10.txt:  a messed-up price list near the phrase beginning "From separate sheets he then read".
51 12/10/01 Closed In benhr10.txt, there are 3-4 instances in which you get things like this:  VALERIUS turns into <i><i>Valerius</i></i>.
50 12/10/01 Closed Line drawings with dashes and vertical bars appear in benhr10.txt.  (Search for "Gesius".)  They are completely bogus after conversion.
49 12/10/01 Closed An empty paragraph can be opened but not closed under some circumstances at the end of a file.  Actually, this seems to happen in almost every file.
48 12/10/01 Closed When the PG header is discarded, there can be a closing tag </pre> without an opening tag <pre>.
47 12/09/01 Possible Consider alternate output formats:  DocBook, XML, or RTX.  (Thanks to Craig Morehouse.)
46 12/09/01 May be impossible When "dialect" is used -- i.e., when the author has simply made up a lot of new words to express how something sounds -- there is a rather high probability that the made-up words match some words in a foreign language, and hence are rendered as italicized.  A similar problem occurs if the author has simply made up names.
45 12/08/01 Possible Consider the use of Cascading Style Sheets for the HTML.  (Thanks to Terence Tan.)
44 12/08/01 Closed Add a command-line switch to allow single spaces between sentences and after colons.  (Thanks to Terence Tan.)
43 12/08/01 To-do Investigate the feasibility of using the HTML tags <q> and </q> rather than opening/closing quotes.  (Thanks to Terence Tan.)
42 12/08/01 Closed The HTML created by GutenMark is ugly, resulting in less readable source HTML:
  1. Newlines may appear before closing tags rather than after them .
  2. Upper/lower case of tags and entitites is inconsistent.
  3. Things like <p align="justify"> would be preferable to <p align=justify>.
(Thanks to Terence Tan for these comments.)
41 12/08/01 Closed The very first word in a section may not be correctly treated as being at the beginning of a sentence, and therefore if in ALL-CAPS will not be capitalized properly.  An example from wuthr10.txt in which YESTERDAY is the first word of a chapter converts to "yesterday " rather than "Yesterday ."  (Thanks to Terence Tan.)
40 12/08/01 To-do Need to check that texts in which single-quotes are used systematically in place of double-quotes (such as wuthr10.txt) are handled correctly.
39 12/05/01 Closed GutenMark embeds the compilation date in the disclaimer it adds to the "prefatory area".  This causes the regression test built into the makefile to fail if built on a later date.  Resolution: removed.
38 12/05/01 To-do ALL-CAPS Roman numerals may or may not be handled correctly.
37 12/04/01 To-do For people who actually want to view HTML output in their browser, most HTML files currently output will be too large. There needs to be a command-line option to break the file into smaller files, perhaps at chapter headings.
36 12/03/01 Closed Require more-flexible means of locating the GutenMark.cfg file. Resolution:  If the GutenMark.cfg file isn't found, the program now looks for a configuration file with the same name (and in the same directory) as the executable, but with ".cfg" suffixed to the filename.  This allows a sensible global installation procedure in which the configuration file and the wordlists can be in the same directory as the executable, but also allows the user to override the default configuration file with one of his own.
35 12/03/01 To-do Require a more-sensible installation procedure, with less manual steps.
34 12/03/01 To-do There is an appearance of "--" not converted to emdash in bldhb10.html.  It may involve a sequence such as "- -".
33 12/01/01 Closed Require a means of adding a title page and copyright page.  Resolution:  html2ps can add a title page, even if I don't know how.  Let this be done in post-processing, where it belongs.
32 12/01/01 Possible Addition of diacriticals and ligatures (such as the oe ligature), which don't fit into the 8-bit subset of the HTML 4.0 character set, to the wordlists.
31 12/01/01 Closed Add TeX hyphenation data directly to wordlists via soft-hyphens.  The Norwegian wordlist already has this feature.  Resolution:  Since html2ps can already use the TeX hyphenation data, this is just reinventing the wheel.  Let html2ps or some other post-processor handle it.
30 12/01/01 To-do Lists of proper names should be provided for more languages, particularly Latin.
29 12/01/01 Partially handled, for single-word placenames.
Full treatment to-do.
Geographical references should not be italicized unless in ALL-CAPS, and should be capitalized properly in this case.  Since many placenames are multi-word, this cannot be completely handled by the wordlist mechanism.
28 12/01/01 Ongoing All existing wordlists, particularly Latin and German, require improvement.
27 12/01/01 Closed Foreign-phrase detection should be expanded to provide language-consistency within an individual phrase.  For example, once it has been determined that a multi-word phrase is in Latin, Spanish words should no longer be preferentially regarded as foreign within that phrase.
26 12/01/01 To-do Automatic detection of text native language, rather than relying on command-line parameter.
25 12/01/01 To-do Language-profile should be used to modify the type of quotation marks.
24 11/26/01 Fixed 06/15/02 LaTeX issue:  GutenMark supports all of the HTML 4.0 alphabetic characters in the numerical range 192-255, and not just the normal ASCII alphabetic characters ('a'-'z' and 'A'-'Z').  Mostly, these are like the normal alphabetic characters, but with added diacritical marks like umlauts, accents, and so on.  These additional characters are simply missing in the LaTeX output.   Later:  There are also badly-displayed 7-bit characters, including |, <, >, ^, and maybe others.
23 11/26/01 Closed LaTeX issue:  Sometimes the table of contents appears (tmotb10.txt), and sometimesit does not (bldhb10.txt).  Resolution:  Apparently, this has to do with the way the latex program is run, rather than any problem in the converted etext.
22 11/26/01 Closed LaTeX issue:  When running the latex program on the latex output, there's the occasional message about hboxes being too wide. Resolution:   This is apparently normal behavior for LaTeX.
21 11/26/01 Fixed 06/15/02 LaTeX issue:  A command-line option that allowed EITHER a blank line between paragraphs with no indenting, OR ELSE indented paragraphs with no blank lines would be useful.
20 11/26/01 Fixed 06/15/02. LaTeX problem:  For the chapter headings, it adds the words "Chapter 1", "Chapter2", and so on, so that if the actual names of the chapters are (say) "CHAPTER 1", "CHAPTER 2", you'd see headings that looked like
Chapter 1
CHAPTER 1
19 11/26/01 Fixed 06/16/02. LaTeX problem:  In some texts (bldhb10.txt, for example), paragraph indentation seems relatively normal.  In others (tmotb10.txt), the paragraph indentation seems all screwed up.  The text ALL seems to be indented, with just the occasional line (at random, seemingly) not indented.   There's no way to distinguish one paragraph from the next.
18 Antiquity Possible. Bullets.  I haven't seen many bullets in PG etexts, but I'm sure GutenMark won't handle them.
17 Antiquity May be 
impossible
Illustrations.  Well, PG etexts don't have illustrations.  But still ...
16 Antiquity May be 
impossible 
within 
HTML. 
May need 
A.I.
Spacing in verse or dramatic scripts.  Verse and scripts (like plays) are depicted in a variable-width font, and this may result in incorrect alignment among successive lines.  Consider the following example, that might appear in a play, in which several characters respond simultaneously to another character: 
              ( Nonsense!
              | You're not serious!
I'm leaving!  { What! 
              | Not a chance!
              ( That's crazy talk!
The intention of the person creating the etext was clearly that a single large left-hand brace should precede the text at the right.  GutenMark, however,  will not only not add a large brace,  but will jumble up the spacing so that it doesn't even look as good as it does here.
15 Antiquity May need
A.I.
Attributions.  By this, I mean quotes which are set off from the surrounding text, and which are followed by the author's name (which is supposed to be at the far right of the quotation).  Actually, GutenMark's treatment of this case seems to be not unreasonable, but it needs improvement to be professional.
14 Antiquity May need 
A.I.
Detection and treatment of double-column verse.  I'm not sure this appears in any actual Gutenberg text, but I know that it does appear in certain books that have been partially converted to PG, such as Burton's Arabian Nights.
13 Antiquity Ongoing Improvement of table-detection and treatment, as in FLYMC10.TXT.
12 Antiquity To-do Dealing with things like "right-" when appearing at the end of the line, as (for example) in the phrase "this happens with both the right- and left-hand versions."  GutenMark would threat this as "this happens with both the right-and left-hand versions."
11 Antiquity To-do Use of systematic misuse of "-" where "--" was actually intended.
10 Antiquity To-do Removal of false hard-hyphens.  For example, suppose one line of the etext ended with "soft-", and the next line began with "hyphen".  Should this be treated as "soft-hyphen" or as "softhyphen"?
9 Antiquity May need
A.I.
Footnotes/endnotes.  Innumerable footnote/endnote styles appear in PG etexts.  Here are some cases I've found:
  • Footnotes/endnotes discarded.
  • Footnotes embedded directly in text, in brackets like [text of footnote].
  • Endnotes marked with numbers in braces in the text {10} and collected at the end of the file: "{10} text of endnote."
  • Endnotes marked in the text like [FN#10] and collected at the end of the file like "[FN#1] text of endnote."
  • Endnotes marked with "*", "**", etc., in the body of the text, and collect at the ends of the chapters like "* text of endnote."
  • Footnotes marked with bracketed numbers [10] in the text body, and then collected at the ends of paragraphs like "[10] text of footnote."
  • Endnotes marked in the body with "(a)", "(b)", "(c)", etc.
8 Antiquity May need
A.I.
Restoration of Greek transliterated to Latin, back into Greek.  In some PG etexts, Greek text is simply discarded (and obviously cannot be recovered).  In other cases it has been transliterated to Latin characters, but there are various schemes for doing so, and these are seldom specified.  Furthermore, the transliterated text is often not marked in any way as being Greek.
7 Antiquity May need 
A.I.
Restoration of missing currency symbols, particularly Pound (£) and Yen (¥).
6 Antiquity To-do Restoration of Spanish inverted exclamation points (¡) and question marks (¿).
5 Antiquity To-do The ability to recognize and italicize book titles should be added, along with a database of book titles in various languages.
4 Antiquity To-do Determination of Title/Author should be improved by using PG header data rather than just the first line of the file.
3 Antiquity Ongoing Recognition of verse vs. normal paragraph text needs improvement.
2 Antiquity Ongoing Identification of "prefatory" section needs improvement.
1 Antiquity Ongoing Identification of section headings needs improvement.


©2001-2002 Ronald S. Burkey.  Contact me .