Cache mode storage

Introduction

Beginning from version 3.1.5 mnoGoSearch supports new words "cache" storage mode able to index and search quickly through several millions of documents.

Cache mode word indexes structure

The main idea of cache storage mode is that word index is stored on disk rather than SQL database. URL information (table "url") however is kept in SQL database. Word index is divided into 8192 files using 32 bit word_id built with CRC32 of the word. Index is located in files under /var/tree directory of mnoGoSearch installation.

Cache mode tools

There are tree additional programs cachelogd, splitter and mkind used in "cache mode" indexing.

cachelogd is a TCP daemon which collects word information from indexers and stores it on your hard disk.

splitter is a program to create fast word indexes using data collected by cachdlogd. Those indexes are used later in search process.

mkind is a tool to create search limits by tags, category, etc.

Starting cache mode

To start "cache mode" follow these steps:

  1. Start cachelogd server:

    cd /usr/local/mnogosearch/sbin (/sbin directory of base mnoGoSearch installation) ./cachelogd & 2>cachlogd.out

    It will write some debug information into cachelogd.out file. Cachelogd also creates a pid file in /var directory of base mnoGoSearch installation.

    Cachelogd listens to TCP connections and can accept several indexers from different machines. Theoretical number of indexers is about 128. Cachelogd stores information sent by indexers in /var/raw/ directory of mnoGoSearch installation.

    You can specify port for cachelogd to use without recompiling. In order to do that, please run

    ./cachelogd -p8000

    where 8000 is the port number you choose.

    You can as well specify a directory to store data (it is /var directory by default) with this command:

    ./cachelogd -w /path/to/var/dir

  2. Configure your indexer.conf as usual and add these two lines:

    DBMode cache LogdAddr localhost:7000

    LogdAddr command is used to specify cachelogd location. Each indexer will connect to cachelogd on given address at startup.

  3. Run indexers. Several indexers can be executed simultaneously. Note that you may install indexers on different machines and then execute them with the same cachelogd server. This distributed system allows making indexing faster.

  4. Creating word index. When some information is gathered by indexers and collected in /var/raw/ directory by cachelogd it is possible to create fast word indexes. "splitter" program is responsible for this. It is installed in /sbin directory. Note that indexes can be created anytime without interrupting current indexing process.

    Indexes are to be created in the following two steps:

    1. Sending -HUP signal to cachelogd. cachelogd will close current working logs and reopen new logs. You can use cachelogd pid file to do this:

      kill -HUP `cat /usr/local/mnogosearch/var/cachelogd.pid`

    2. Building word index. Run splitter without any arguments:

      /usr/local/mnogosearch/sbin/splitter

      It will take sequentially all 4096 prepared files in /var/splitter/ directory and use them to build fast word index. Processed logs in /var/splitter/ directory are removed after this operation.

Optional usage of several splitters

splitter has two command line arguments: -f [first file] -t [second file] which allows limiting used files range. If no parameters are specified splitter distributes all 4096 prepared files. You can limit files range using -f and -t keys specifying parameters in HEX notation. For example, splitter -f 000 -t A00 will create word indexes using files in the range from 000 to A00. These keys allow using several splitters at the same time. It usually gives more quick indexes building. For example, this shell script starts four splitters in background:


#!/bin/sh
splitter -f 000 -t 3f0 &
splitter -f 400 -t 7f0 &
splitter -f 800 -t bf0 &
splitter -f c00 -t ff0 &

Using run-splitter script

There is a run-splitter script in /sbin directory of mnoGoSearch installation. It helps to execute subsequently all three indexes building steps.

"run-splitter" has these two command line parameters:

run-splitter --hup --split

or a short version:

run-splitter -k -s

Each parameter activates corresponding indexes building step. run-splitter executes all three steps of index building in proper order:

  1. Sending -HUP signal to cachelogd. --hup (or -k) run-splitter arguments are responsible for this.

  2. Running splitter. Keys --split (or -s).

In most cases just run "run-splitter" script with all -k -s arguments. Separate usage of those three flags which correspond to three steps of indexes building is rarely required.

Doing search

To start using search.cgi in the "cache mode" edit as usually your search.htm template and add this line: DBMode cache

Using search limits

After updating cache by splitter you should create search limits, if you plan use its for search constraints. To create these limits use mkind with follow options:


Usage: mkind [OPTIONS] [config file]

Options are:
  -c            create CATEGORY index
  -t            create TAG index
  -h            create TIME (hour) index
  -m            create TIME (min) index
  -u            create HOST (URL) index
  -l            create LANGUAGE index

To use, for example, search limit by tag, add follow line to search.htm or to searchd.conf, if searchd is used.


Limit t:tag:lim_tag
where t - name of CGI parameter (&t=) for this constraint, tag - type of constraint, lim_tag.dat - filename for this limit.