Liam Quin's text retrieval package (lq-text) Sat Nov 27 22:50:31 EST 1993 src/h/Revision.h defines this as Revision 1.13. NOTE: this is not the "README" file that you put into a database directory; use Sample/README for that (and then edit it). lq-text is copyright 1989, 1990, 1991, 1992, 1993 Liam R. E. Quin; see src/COPYRIGHT for details. Parts of the source may also be copyrighted by the University of California at Berkley - see src/qsort.c and src/db-1.xx. Lqtext is a text retrieval package. That means you can tell it about lots of files, and later you can ask it questions about them. The questions have to be which files contain this word? which files contain this phrase? but this information turns out to be rather useful. Lqtext has been designed to be reasonably fast. It uses an inverted index, which is simply a kind of database. This tends to be smaller than the size of the data, but more than half as large. You still need to keep the original data. Commands are lqword -- information about words lqphrase -- look up phrases lqrank -- combine phrase searches, and sort the results lqaddfile -- add files to the database (at any time) lqshow -- show the matches on the screen (uses curses) lqtext -- curses-based front end. lq -- shell-script front end lqkwic -- creates keyword-in-context indexes (this is fun!) There are about 11,000 lines of C in total, or which 8,000 are the text database and 3,000 are the curses front end (lqtext). Well, last time I counted, anyway. Here are some examples, based mostly on the (King James) New Testament, simply because that is what I have lying around. mieza!lee> time lqphrase 'wept bitterly' 2 35 10 955 KingJames/NT/Matthew/matt26.kjv 2 26 47 995 KingJames/NT/Luke/luke22.kjv 0.6 real 0.0 user 0.2 sys // The first number is the number of words in the // phrase -- 2 for "wept bitterly" mieza!lee> time lqword -l jesus > XXX 1.0 real 0.4 user 0.4 sys mieza!lee> wc XXX 983 4915 68604 XXX mieza!lee> sed 12q XXX 1 0 8 930 KingJames/NT/Matthew/matt01.kjv 1 5 21 930 KingJames/NT/Matthew/matt01.kjv 1 6 24 930 KingJames/NT/Matthew/matt01.kjv 1 8 48 930 KingJames/NT/Matthew/matt01.kjv 1 10 49 930 KingJames/NT/Matthew/matt01.kjv 1 0 4 931 KingJames/NT/Matthew/matt02.kjv 1 6 4 932 KingJames/NT/Matthew/matt03.kjv (and so on for 983 lines) So there are nine hundred and eighty-three matches. The line for each match gives the block in the file, the word within the block, the file number, and the filename. The above timings were on a 16 MHz SPARC 4/110. More useful things to do include: // see some of the matching text: mieza!lee> lqphrase 'wept bitterly' | lqkwic ==== Document 1: /home/mieza/lee/text/bible/KingJames/NT/Matthew/matt26.kjv ==== 1: thrice. And he went out, and wept bitterly. ==== Document 2: /home/mieza/lee/text/bible/KingJames/NT/Luke/luke22.kjv ==== 2:22:62 And Peter went out, and wept bitterly. 22:63 And the men that held Je mieza!lee> // which words contain "foot" or "feet"? mieza!lee> lqwordlist -g "f[oe][oe]t" afoot barefoot brokenfooted clovenfooted feet foot footmen footstep footstool fourfooted // documents containing "shoe" and "barefoot" mieza!lee> lqrank "barefoot" "shoe" | lqkwic ==== Document 1: /home/mieza/lee/text/bible/KingJames/OT/Isaiah/isa20.kjv ==== 1:ff thy loins, and put off thy shoe from thy foot. And he did so, walking na 2: he did so, walking naked and barefoot. 20:3 And the LORD said, Like as my 3: Isaiah hath walked naked and barefoot three years [for] a sign and wonder 4:ves, young and old, naked and barefoot, even with [their] buttocks uncovere // save a query... docs containing any of the following: mieza!lee> lqrank -r or serpent witch snake stick rod > skinny-things // documents containing abraham said, or god of abraham: mieza!lee> lqrank -r or "abraham said" "God of Abraham" > abe // documents appearing in both sets of results (intersect), if any: mieza!lee> lqrank -r and -f skinny-things -f abe |lqkwic ==== Document 1: /home/mieza/lee/text/bible/KingJames/OT/Exodus/exod04.kjv ==== 1:in thine hand? And he said, A rod. 4:3 And he said, Cast it on the ground. 2:n the ground, and it became a serpent; and Moses fled from before it. 4:4 A 3:nd caught it, and it became a rod in his hand: 4:5 That they may believe th 4:ORD God of their fathers, the God of Abraham, the God of Isaac, and the God 5:4:17 And thou shalt take this rod in thine hand, wherewith thou shalt do si 6: of Egypt: and Moses took the rod of God in his hand. 4:21 And the LORD sai mieza!lee> // Ah, it was Moses I was thinking of... The "lq" shell script is much more convenient for simple queries. It's interactive -- give it a try. How to Install lq-text unpack this tar cd lq-text/src edit Makefile edit h/globals.h (following the instructions in there) edit Makfile again after reading globals.h :-) make -i depend # If you have mkdep. If you don't, and you can't get it, # don't worry. make # this will put things in src/bin and src/lib make install # This will put things in $BINDIR and $LIBDIR. How to Use It (see doc/*) Make a directory $HOME/LQTEXTDIR (or set $LQTEXTDIR to point to the (currently empty) directory you want to contain the new database). Include lq-text/src/bin and lq-text/src/lib in your search path if you haven't done a "make install" yet. Put a README file in $LQTEXTDIR: docpath /my/login/directory:/or/somewhere/else common Common and make an empty file called Common (or include words like "uucp" that you don't want indexed) in the same directory. You can copy lq-text/Sample/README if you want, and then edit it. The common word list is searched linearly, so it is worth keeping it fairly short. Usually about a dozen words is plenty. Don't bother including words less of than three letters unless you have edited src/wordrules.h, or have changed minwordlength in Sample/README, as short words aren't normally included in the index. Find some files (e.g. your mailbox) and say lqaddfile -t2 file [...] You should see some diagnostic output... (this is what -t2 does). lqaddfile may take several minutes to write out its data, depending on the system. Try a small file first -- you can add more later! Another fun thing to try is setting DOCPATH to /usr/man and running cd /usr/man find man* -type f -print | lqaddfile -t2 -f - to make an index of the manual pages (use cat* instead of man* if you prefer). If you have less than 10 meg or so of RAM, give lqaddfile the -w100000 option -- this is the number of words to keep in memory before writing to the database. The idea is that the number should be small enough to prevent frantic paging activity! I find that on my Sun 4/110, -w100000 makes lqaddfile grow to maybe 2 megabytes; 300000 takes it up to 8 or 10 megabytes, but makes it run a *lot* faster. It's best to add lots of files at once, as in the example above using find(1), rather than adding a file at a time - it can make a very large difference in indexing speed, although probably no difference in retrieval times in most cases. Now try lqword ---> an unsorted list of all known words lq ---> type phrases and browse through them lqtext ---> curses-based browser, if it compiled. lqrank ---> a sorted list of matches lqkwic `lqphrase "floppy disk"` ---> this is the most fun. lqshow `lqphrase "floppy disk"` ---> lq does this for you If the files you are indexing have pathnmames with leading bits in common (e.g. indexing a directory such as /usr/spool/news, or /home/zx81/lee/text/humour), make use of DOCPATH. This is searched linearly, so a dozen or so entries is the practical limit at the moment. For example, if your README file contained the line docpath /usr/spool/news:/shared-text/books:. and you ran the command lqaddfile simon/chapter3 lqaddfile would look for /usr/spool/news/simon/chapter3 /shared-text/simon/chapter3 ./books/simon/chapter3 in that order. But it would only need to store "simon/chapter3" in the index, and this can save a lot of space if you index large numbers of files. Of course,it's up to you to ensure that all of the filenames you pass to lqaddfile are unique! Every indexed pathname must fit into a dbm page, which is 4KBytes with sdbm but probably much less (e.g. 512) with dbm. With bsdhash this problem has gone away. Known Problems lqaddfile may run slowly if the database directory is mounted over a network with NFS. Run lqaddfile on the NFS server -- there's no problem with having the data files on a remote system, as long as all of the systems accessing (and indexing) the data have the same CPU architecture. The speed difference is approximately a factor of two or three, depending on the speed of the NFS server and the amount of memory on the client. With this distribution I am including both Ozan Yigit's sdbm package and the BSD hash package (db) written by Ozan Yigit and Margo Seltzer. I'm including db 1.71; there is at least one more recent version, 1.72, but the differences are backward compatibility with older versions of db, which doesn't matter for lq-text; both 1.71 and 1.72 are pretty recent. Try using db first, and if that doesn't work use sdbm. Sdbm has been ported extensively, but is slower. Db is part of 4.4 BSD; it works fine on SunOS 4.x, but I haven't tried it on other systems, notable System V-based systems. If you end up with one or more empty .dir or .pag files in the LQTEXTDIR directory, you probably have a broken sdbm/ndbm/dbm. Try recompiling with a different dbm package if possible. In particular, early versions of sdbm had this problem. There are some tests, but it is not always clear how to run them. I intend to make a little test suite... If you get strange error messages, try testbin/dbmtry 5000 (this will make and leave behind either one or two files in /tmp). Then try testbin/dbmtry 10000. If that gives errors, the most likely problem is that you have a faulty bcopy. I have included a version of bcopy() that is linked in by default -- perhaps you aren't using it? Do _not_ use memcpy(), as it doesn't handle overlapping regions correctly. If -lmalloc fails, simply remove it in src/Makefile. If you don't have , you can make an empty file called h/malloc.h (ugh). I ship a Makefile with -lmalloc because it's such a big win when it is available, and I wouldn't want anyone to forget it! On a sun, gcc might have some strange problems with libraries. If so, use cc. Sorry. You can use -O on all systems I've tried, and -O4 seems OK on the Sun -- at any rate I have done this on my Sun 4/110 under SunOS 4.0.3 here. You can even compile with -O4 -Qoption iropt -l4 to do loop unrolling, if you want. This makes the binaries bigger and may give a speed improvement. I have not tried using Sun's acc. In ancient history, I used gcc -Wall under 386/ix. I no longer have access to Interactive Unix. Versions of Unix predating the Norman Conquest may cause problems too. For serious debugging, see the notes in src/Makefile. If you are debugging C programs without Saber-C, the first thing to do is to buy it. It's worth it... Otherwise, for debugging, compile with -DASCIITRACE. You could also use -DMALLOCTRACE, which makes the malloc() routines print messages to stderr, which can be processed with awk -- see test/malloctrace. If you use -DWIDINBLOCK everything will be much slower, but more errors are reported. WIDINBLOCK makes lqaddfile store in each data block the Word Number (WID) of the owner of that data block. This uses 4 bytes out of every 64 bytes of index, so you don't want to leave this on by mistake! See also PORTING and GuidedTour. Lee Liam R. E. Quin lee@sq.com {uunet,utzoo,cs.toronto.edu}!sq!lee