The Gatherer program does not automatically do any periodic updates -- when you run it, it processes the specified URLs, starts up a gatherd daemon (if one isn't already running), and then exits. If you want to update the data periodically (e.g., to capture new files as they are added to an FTP archive), you need to use the UNIX cron command to run the Gatherer program at some regular interval.
To set up periodic gathering via cron, use the RunGatherer command that RunHarvest will create. An example RunGatherer script follows:
#!/bin/sh # # RunGatherer - Runs the ATT 800 Gatherer (from cron) # HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME PATH=${HARVEST_HOME}/bin:${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/lib:$PATH export PATH cd ${HARVEST_HOME}/gatherers/att800 exec Gatherer att800.cf
You should run the RunGatherd command from your /etc/rc.local file, so the Gatherer's database is exported each time the machine reboots. An example RunGatherd script follows:
#!/bin/sh # # RunGatherd - starts up the gatherd process (from /etc/rc.local) # HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME PATH=${HARVEST_HOME}/lib/gatherer:$PATH; export PATH gatherd -dir ${HARVEST_HOME}/gatherers/att800/data 8001
The Gatherer maintains a local disk cache of files it gathered to reduce
network traffic from periodic gathering or from restarting aborted
gathering attempts . However, since the remote server must still be contacted
whenever Gatherer runs, please do not set your cron job to run
Gatherer frequently. A typical value might be weekly or monthly,
depending on how congested the network and how important it is to have the
most current data.
If you want your Broker's index to reflect new data, then you must run the Gatherer and run a Broker collection. By default, a Broker will perform collections once a day. If you want the Broker to collect data as soon as it's gathered, then you will need to coordinate the timing of the completion of the Gatherer and the Broker collections.
If you run your Gatherer frequently, the Gatherer's local disk cache may interfere with retrieving updates. By default, objects in the local disk cache expire after 7 days; however, you can expire objects more quickly by setting the GATHERER_CACHE_TTL environment variable to the number of seconds for the Time-To-Live (TTL) before you run the Gatherer, or you can change RunGatherer to remove the Gatherer's tmp directory after each Gatherer run. For example, to expire objects in the local disk cache after one day:
% setenv GATHERER_CACHE_TTL 86400 # one day % ./RunGatherer
One final note: the Gatherer's local disk cache size defaults to 32 MBs, but you can change this value by setting the HARVEST_MAX_LOCAL_CACHE environment variable to the number of bytes before you run the Gatherer. For example, to have a maximum cache of 10 MB you can do as follows:
% setenv HARVEST_MAX_LOCAL_CACHE 10485760 # 10 MB % ./RunGatherer
If you have access to the software that creates the files that you are indexing (e.g., if all updates are funneled through a particular editor, update script, or system call), you can modify this software to schedule realtime Gatherer updates whenever a file is created or updated. For example, if all users update the files being indexed using a particular program, this program could be modified to run the Gatherer upon completion of the user's update.
Note that, when used in conjunction with cron, the Gatherer provides a more powerful data ``mirroring'' facility than the often-used mirror package. In particular, you can use the Gatherer to replicate the contents of one or more sites, retrieve data in multiple formats via multiple protocols (FTP, HTTP, etc.), optionally perform a variety of type- or site-specific transformations on the data, and serve the results very efficiently as compressed SOIF object summary streams to other sites that wish to use the data for building indexes or for other purposes.