GutenMark
Wordlists Page
Attractively formatting Project
Gutenberg texts
|
|
Contents
What are wordlists and namelists?
What are they good for?
What wordlists are available?
Massaging the wordlists
Configuring
What are wordlists and namelists?
They are simply lists of words or names in a given language, prepared in
a format required by GutenMark.
What are they good for?
In reformatting Project Gutenberg etexts, there are many features of the
text that GutenMark has a relatively easy time interpreting, because
interpreting them is simply a matter of transforming data present within
the etext into another form. In many other cases, however, the data
needed for attractive formatting has simply been discarded -- or at least,
reduced -- during creation of the etext, and hence is no longer present
in the etext. In this case, GutenMark has to work a lot harder.
It tries its best to recreate this data from whatever general knowledge
(not specific to the text) that may be available. Some knowledge
of this kind can be obtained from wordlists and namelists, and can be applied
to the following problems:
-
The ALL-CAPS style of italicizing. [NOTE: GutenMark
can support this feature without wordlists, but using wordlists makes the
support better.] Many PG etexts represent italicized words
in all-capital letters, as in "I can't believe SHE did that!" In
print, this should be rendered as "I can't believe she did that!"
Unfortunately, it is not obvious how the italicized word should be capitalized,
or for that matter, if they even should be italicized. Consider
the following examples: "I can't believe JOHN did that!" and "I can't
believe NASA did that!" These should be rendered "I can't believe
John
did that!" and "I can't believe NASA did that!" GutenMark
handles this by means of wordlists/namelists. Since the wordlists
contain the word "she", but not "She" or "SHE", GutenMark knows
that SHE must be converted to she. Since the wordlists contain
"John" but not "JOHN", JOHN is converted to John. Finally,
since the wordlists actually contain the word "NASA", GutenMark
understands that it should be left unchanged.
-
Italicizing foreign words. In English, it is correct (though
it seems to me to be increasingly rarer) to italicize foreigns words and
phrases, such as "the military's junta was unsuccessful."
Obviously, wordlists can be used to locate these non-english words.
-
Restoration of diacritical marks. Most PG etexts discard diacritical
marks. For example, the name "Schrödinger" would normally appear
in a PG etext as "Schrodinger." By consulting the wordlists, the
correct form is known and can be restored.
-
Hyphenation. [NOTE: This feature is not yet implemented.]
PG etexts contain so-called "hard carriage-returns" at the ends of the
text lines, which GutenMark is forced to remove in order to re-justify
the paragraphs. If a word in the PG etext was broken across two lines
by hyphenation (a practice that PG does not recommend), then a false hyphen
might appear in the re-justified etext. For example, if one text
line ended with "in-" and the next began with "visible", then the marked-up
HTML would contain "in-visible." GutenMark can examine the
wordlists to determine that "invisible" is a correct form, and that the
hyphen can therefore be removed.
What wordlists are available?
I have not created any wordlist data myself (except as indicated),
have no copyrights on the wordlists, and am not in a position to grant
licenses for them. I believe that all of the wordlists available
from the GutenMark website should be freely usable (to the extent
described below), but if you have information to the contrary, please inform
me. In each case, the
data for creating the wordlist was available for free download from the
Internet, and was then massaged by software utilities (available in the
GutenMark
distribution) to transform the wordlist into a GutenMark-compatible
format. This being the case, you could easily download the data from
the original source, and process it yourself into the required format.
Description
|
Original Source
|
Transformation
Utilities
(see below)
|
Apparent Status of Source Data
|
My own special English words. (Things that annoyed me by
not being in english.words.gz.)
|
Right here!
|
n/a
|
GPL. |
My own special non-English words. (Things that annoyed me
by not being in the non-English wordlists.)
|
Right here!
|
n/a
|
GPL. |
U.S. namelist
|
dist.all.last
dist.female.first
dist.male.first
|
names_english
|
This data is from the U.S. Census Bureau, and
seemingly available under the Freedom of Information Act. |
U.S. placenames
|
Numerous files from U.S.
Geological Survey
|
USGS
sort
NoDups
|
Public domain. |
Non-U.S. placenames
|
Numerous files from the National
Imaging and Mapping Agency
|
NIMA
sort
NoDups
|
No copyright
or licensing restrictions |
French namelist
|
Francais-GUTenberg-v1.0.tar.gz
|
ispell -e
string2line
|
GPL |
English wordlist
|
ispell-enwl-3.1.20.tar.gz
|
n/a
|
Free, but refer to the documentation for restrictions. |
French wordlist
|
Francais-GUTenberg-v1.0.tar.gz
|
ispell -e
string2line
|
GPL |
Older, smaller, German wordlist, old spelling
rules
(german.words.gz)
|
hk2-deutsch.tar.gz
|
ispell -e
hk2_deutsch
|
Seemingly free, but I can't be 100% sure from
the docs. This was bundled with my SuSE Linux distribution. |
Newer, bigger, German wordlist, new spelling rules (german2.words.gz)
|
igerman
|
ispell -e
hk2_deutsch
|
GPL. |
Latin wordlist
|
dictpage.txt
|
words197
|
Free, though it's hard to infer this with certainty
from the docs. Here is the assurance I
received when inquiring directly of the author. |
Italian wordlist
|
ispell-it2000.tgz
|
ispell -e
string2line
|
GPL |
Spanish wordlist
|
espa~nol.tar.gz
|
ispell -e
espa~nol_filter
|
GPL |
Norwegian wordlist
|
ispell-norsk-2.0.tar.gz
|
make
norsk
|
GPL |
Gaelic wordlist
|
ispell-gaeilge-1.0.tar.gz
|
ispell -e
string2line
|
GPL |
Danish wordlist
|
ispell-da-1.4.21.tar.gz
|
ispell -e
string2line
|
GPL |
Swedish wordlist
|
iswedish-1.2.1.tar.gz
|
ispell -e
string2line
|
GPL |
Finnish wordlist
|
finnish.dict.bz2
finnish.large.aff.bz2
|
ispell -e
string2line
|
GPL |
Massaging the wordlists
As mentioned above, you don't need to use the wordlists provided on the
GutenMark
download page. This is done simply as a convenience for you:
alternately, you could download the original datasets from their creators
and massage them with GutenMark-provided utilities to get the necessary
wordlists. Or you could even produce completely new GutenMark
wordlists for unsupported languages or other purposes.
The format of a GutenMark wordlist is simple:
-
It is an ASCII text file, which has been compressed with the GNU gzip program.
-
It contains a line for each word. The lines can't contain any whitespace,
or anything other than the word itself.
-
The words should be capitalized as follows: If a word must
be in all-caps, like "NASA", then put it in all-caps. If the word
requires some special capitalization, such as "John" or "MacMurray", then
capitalize it accordingly. For normal words that are usually in lower-case,
but are capitalized at the beginnings of sentences, use all lower-case.
-
The words can contain any character in the following table, but not leading
or trailing apostrophes. The table includes both numerical
codes (for non-ASCII characters) and the characters themselves, but the
characters may or may not appear correctly, depending on your browser and
its settings:
'
(apostrophe)
|
|
173:
(soft hyphen)
|
|
|
|
A
|
a
|
192: À
|
217: Ù
|
224: à
|
249: ù
|
B
|
b
|
193: Á
|
218: Ú
|
225: á
|
250: ú
|
C
|
c
|
194: Â
|
219: Û
|
226: â
|
251: û
|
D
|
d
|
195: Ã
|
220: Ü
|
227: ã
|
252: ü
|
E
|
e
|
196: Ä
|
221: Ý
|
228: ä
|
253: ý
|
F
|
f
|
197: Å
|
222: Þ
|
229: å
|
254: þ
|
G
|
g
|
198: Æ
|
223: ß
|
230: æ
|
255: ÿ
|
H
|
h
|
199: Ç
|
|
231: ç
|
|
I
|
i
|
200: È
|
|
232: è
|
|
J
|
j
|
201: É
|
|
233: é
|
|
K
|
k
|
202: Ê
|
|
234: ê
|
|
L
|
l
|
203: Ë
|
|
235: ë
|
|
M
|
m
|
204: Ì
|
|
236: ì
|
|
N
|
n
|
205: Í
|
|
237: í
|
|
O
|
o
|
206: Î
|
|
238: î
|
|
P
|
p
|
207: Ï
|
|
239: ï
|
|
Q
|
q
|
208: Ð
|
|
240: ð
|
|
R
|
r
|
209: Ñ
|
|
241: ñ
|
|
S
|
s
|
210: Ò
|
|
242: ò
|
|
T
|
t
|
211: Ó
|
|
243: ó
|
|
U
|
u
|
212: Ô
|
|
244: ô
|
|
V
|
v
|
213: Õ
|
|
245: õ
|
|
W
|
w
|
214: Ö
|
|
246: ö
|
|
X
|
x
|
|
|
|
|
Y
|
y
|
216: Ø
|
|
248: ø
|
|
Z
|
z
|
|
|
|
|
Unfortunately, the process of creating a wordlist will not be easy for
most people, and since it varies from case to case it cannot be described
in detail here. It will be easiest for those with programming experience,
and such knowledge is assumed in the next couple of paragraphs.
Most of the existing wordlists were created from language databases
for the *nix spell-checker program called "ispell". (Click here
for more information.) Ispell databases don't contain wordlists as
such, but do contain word data and so-called "affix" files. By combining
these two, with the ispell '-e' command-line switch, a wordlist can be
produced. Some existing ispell databases don't incorporate
diacritical marks directly, but expect them to encoded by some funky sequence
of characters. For example, 'Schrödinger' might appear as 'Schro"dinger'.
GutenMark
wordlists must contain the former rather than the latter.
For this reason and others, ispell wordlists always need some additional
post-processing to be acceptable to GutenMark. The post-processing
for the existing GutenMark wordlists is performed by various little
utility programs (listed in a table above) provided by GutenMark.
All of the utilities are simple command-line filters. For more info,
I fear you must look at the actual source code for the utilities.
Fortunately, this source code is quite simple.
It's probably unlikely that anyone will actually want to create a wordlist.
But if you do, you might want to tell me about it, so that I can add post
the wordlist here for download.
Configuring
The appropriateness and search-order of the various wordlists depends somewhat
on the etexts being formatted. In general, you want to search them
in the following order:
-
Namelists for the language the etext is in.
-
Namelists for other languages.
-
Wordlists for the language the etext is in.
-
Wordlists for other languages, in order of descreasing probability of finding
words from that language within the etext.
Obviously, the default search order may not be appropriate for kinds of
etexts you are converting. For example, your etexts may not be in
English, or they may be more likely to contain Latin than French.
You can change the search order by modifying the file GutenMark.cfg
. You can do this in any text editor, and the way in which you have
to change the file will be obvious to you upon inspection. The configuration
file can contain various named 'profiles', and each profile can incorporate
different language wordlists or different search orders for the wordlists,
and can designate each directory as being "native" or "foreign."
The desired profile can be chosen with GutenMark command-line switches.
GutenMark.cfg must be located in the directory from which you run GutenMark.
The wordlists are also usually in same directory, but need not be if the
configuration file is edited appropriately.
©2001-2002 Ronald S. Burkey. Last updated
03/11/02 by RSB. Contact me.