Ron's Indexing Program (RIP)
Frequently Asked Questions
Console-based text indexing, retrieval, and browsing


 Home  Docs FAQ  Download  Changes  Links
"Reflections" by Lynn Rothan

What's with the whimsical graphic?

This is an image of a painting called "Reflections", courtesy of artist Lynn Rothan.  It depicts the frustration and confusion I felt when first trying to release this software to the public.  (It looks a little like Cary Grant in Arsenic and Old Lace, don't you think?)  My original idea was simply to release the MS-DOS version -- there being no *NIX version at the time -- by announcing my project on freshmeat.net.  I thought that RIP could be ported to *NIX later if -- and it's a big if -- anybody was interested.  But after cleaning up the source code, adding the GPL, writing docs, setting up the web-page, etc., Freshmeat politely informed me that they didn't "do" MS-DOS.  In other words, I needed to port the program to UNIX or else I had no channel to bring the program to public attention.  No wonder the poor birdman in the painting is throwing up his wings in confusion while contemplating MS-DOS (on his right) and UNIX (on his left)!

Check out the artist's websiteif you think the painting looks cool.

Is the file format portable?

I've verified that identical files are produced on MS-DOS (which is 16-bit, 'x86, little-endian) and on iMac Linux (which is 32-bit, PowerPC, big-endian).  I don't guarantee that the files are identical on all other systems, but this would seem to be a pretty good indication that they are.

Where's the Win32 GUI version?

As I mentioned, there is such a program.  It's used only for browsing and searching, but you have to use the command-line version for compressing/indexing the etexts.  I haven't cleaned up the source enough to GPL it and release it yet.  When I've done that, it will appear here.

Is it just me, or is the source-code confusing and inefficient?

The original MS-DOS source code may be confusing because of the many tricks employed to overcome RAM limitations.  Operations that could be accomplished efficiently in an unlimited amount of RAM are split into small pieces to accomodate a 640K total-memory limitation and a 64K limitation on the size of individual objects.  These now-obsolete tricks have not been removed in the UNIX source code.  The UNIX code can undoubtedly be cleaned up a lot, and made to run a lot faster, by removing these artificial barriers.

Also, the source code was originally written for a 16-bit compiler, Borland's Turbo C 2.x, but has been ported to a 32-bit compiler, GNU gcc.  There are numerous differences between these systems that are difficult to overcome in a program of any complexity.  The primary difficulties are that the integer datatypes are different (int and unsigned are 16 bits in Turbo C but 32 bits in GNU gcc) and the "console i/o" functionality of Turbo C is completely missing in gcc and has been mimicked with the ncurses library.  Rather than extensively rewriting RIP to overcome these limitations, I chose instead to write a general-purpose library (TurboC) that could be used to port any Turbo C program (not merely RIP) with minimal rewriting.  But an unfortunate side effect is that the code has thereby become more confusing, in particular because the int and unsigned datatypes appear to be 32-bit, but have been made 16-bit by macro substitution.

How does RIP compare to other text-indexing programs?

Well, I've made no attempt to do a systematic comparison, but ...

Before creating RIP in 1996, I tried as hard as I could to find an existing (free) system that had all of the characteristics I wanted.  A couple of years later, I did find (and purchase) a commercial system that worked quite well (www.dtsearch.com), but I wouldn't characterize it as free -- nor even as "affordable" for individuals other than enthusiasts.  It's a little ironic that exactly two days after reviving this project and creating the RIP website, in 2002, I came across a notice of a pretty acceptable GPL'd indexing/retrieval system having most of the characteristics I want, called Namazu.

Anyhow, I became curious and decided to compare the two systems.  (If I come across any other alternatives, I'll post a comparison with them also.)  For a test, I've indexed the Project Gutenberg year 2000 etexts (i.e., the etexts added just in the year 2000, and not the complete set of etexts as of 2000), from which I've removed the Human Genome Project files (which aren't really text files).  This leaves a set of 498 etexts totalling 199 megabytes uncompressed.  Considering RIP's age, and the fact that it's a 16-bit application, I'm pretty pleased with the results of the comparison.

By the way,  don't treat this as a full feature-by-feature comparison of the systems being examined.  The test involves just the specific application that RIP was designed for, whereas general-purpose indexing systems (such as Namasu) have many features that RIP lacks.
 
Indexing system Test conditions Resulting
database
size
Time taken
to index the
database
Comments
RIP, UNIX 450 MHz iMac (PowerPC) with 320M RAM, running Linux 166M 14 minutes, including compression (None)
RIP, MS-DOS 500 MHz Pentium 3 with 128M RAM, emulating Windows 98 by means of VMware running under Linux. 166M 31 minutes, including compression Since emulated file operations under VMware are very slow, one would suppose that the indexing process would have run much faster on a native Win32 machine.
Namazu 450 PowerPC with 320M RAM running Linux. 76M (projected 133M if all files had been indexed) 78 minutes (projected 137 minutes if all files had been indexed).  Also, the text files were pre-compressed with gzip before indexing, and this processing time is not included in the 78 (137) minutes. Unfortunately, Namazu rejected 78 files, comprising about 43% of the database, as being "too big".  In other words, about 43% of the database was not indexed.  That's why various "projected" numbers appear in the test results.


©2002 Ronald S. Burkey.  Last updated 04/20/02 by RSB.  Contact me.