Extended indexing features

News extensions

By Heiko Stoermer

mnoGoSearch comes with an integrated extension to archive news servers. (currently MySQL only! see the Section called Restrictions) This means that you can now download all messages from a news server an save them completely in a database.

Benefits

  • you can expire the messages on the news server to keep it slim and fast

  • you can search the complete message base with all the features that regular mnoGoSearch offers

  • you can still browse discussion threads over the complete archive

Restrictions

  • currently mysql only (I would have really liked to do this for postgresql, but some really annoying restrictions concerning query size and field size in postgresql finally made me switch to mysql.)

  • perl front-end only

  • single dict only (because mysql-perl front-end does not support multi-dict)

To be implemented

No new features are planned for this thing. It works the way it is (at least as far as I can see) and does everything I wanted it to do. What I will do is to make the code a bit more portable to other databases and fix the few very tiny bugs in the front-end. Of course newly discovered bugs will be fixed. I'm maintaining it as good as I can.

Performance

Of course, important questions always are: how fast.../how big.../how long....

  • Our local intranet installation of mnoGoSearch says the following:

    
    
              mnoGoSearch statistics 
      
        Status    Expired      Total 
       ----------------------------- 
           200      76132      76132 OK 
           404        119        119 Not found 
           503         17         17 Service Unavailable 
           504        802        802 Gateway Timeout 
       ----------------------------- 
         Total      77070      77070 
    
    

    which means that roughly 77.000 messages are archived in the database

  • Current database size is: 423 Megabytes

  • The dict table has 6.076.462 entries

  • It's run on an AMD K6 400 with 64 MBs of RAM (very tiny thing)

  • typical queries take between 2 and 10 seconds.

Installation

  1. Compile:

    Unpack the mnoGoSearch distribution archive. Start the configure script with the option --with-mysql. make and make install as described in the regular install instructions

  2. Create Database:

    The news extension uses a slightly different database layout. The create files can be found in frontends/mysql-perl-news/create/ (Of course you have to do mysqladmin create mnoGoSearch first and set permissions to the account the web front-end and indexer are run as)

  3. Install indexer.conf:

    an indexer.conf for incremental news archiving (messages hardly ever change...) can be found in frontends/mysql-perl-news/etc/ together with a sample cron shell script that can be run once a day or so. Please see indexer.conf for detailed description of the indexing process.

  4. Install perl front-end:

    copy frontends/mysql-perl-news/*.pl and frontends/mysql-perl-news/*.htm* to your cgi-bin directory.

    copy frontends/mysql-perl-news/*.pm to your site's perl library dir (site_perl or so) where the modules can be found by the perl scripts.

    edit search.htm and change the included database login information. The Perl front-end has additional features that allow you to browse message threads. You will see.

  5. Now you are set and can run indexer for the first time according to the instructions you can find in indexer.conf.

I hope this is a nice feature for you. If anyone is interested in porting this to other databases/multidict mode/the PHP front-end, PLEASE DO SO! I would be pleased and will assist you.

Indexing MP3 files

The Mp3 search works only on servers supporting HTTP/1.1 protocol, if you wish to index ftp sites use proxy instead.

MP3 indexer.conf commands

To activate MP3 tags detection set up your indexer.conf with some variables:

CheckMp3Tag yes

When this option is enabled the spider downloads only 128 bytes to determine an MP3 file.

IndexMP3TagOnly yes

When this option is enabled only MP3 tags are indexed, all HTML documents are searched for links only. In other case HTML and TEXT documents are indexed in usual manner and will be searchable.

URLFileWeight 1

You may find useful to activate this option to index MP3 file names.

How indexer processes MP3 tags

If file is recognized as MP3, indexer creates document in the following format:


<html>
<title>$SongName</title>
<meta name=description content="$Artist">
<meta name=keywords content="$Album $Year">
<body>$Artist $Album $Year $SongName</body>
</html>

So, title is filled with song name, description with artist, keywords is built as combination of album and year. Body is a combination of artist, album, year, song name.

Search through author, album, song name

If you want restricted search by author, album or song name, use standard mechanisms described in the section "Changing different document parts weights at search time" of Using search front-ends section. For example, if you want to restrict search by song name use standard by title restriction.

With default weights configuration, given in indexer.conf-dist, you may find useful to add this into search.htm to restrict search area:


<SELECT NAME="wf">
<OPTION VALUE="11110" SELECTED="$(wf)">All sections
<OPTION VALUE="01000" SELECTED="$(wf)">Artist
<OPTION VALUE="00100" SELECTED="$(wf)">Album
<OPTION VALUE="00010" SELECTED="$(wf)">Song name
<OPTION VALUE="10000" SELECTED="$(wf)">File name
</SELECT>

Indexing SQL database tables (htdb: virtual URL scheme)

mnoGoSearch can index SQL database text fields - the so called htdb: virtual URL scheme.

Using htdb:/ virtual scheme you can build full text index of your SQL tables as well as index your database driven WWW server.

Note: currently mnoGoSearch can index only those tables that are in the same database with mnoGoSearch tables. MySQL users may specify database in the query though. Also you must have PRIMARY key on the table you want to index.

HTDB indexer.conf commands

Two indexer.conf commands provide HTDB. They are HTDBList and HTDBDoc.

HTDBList is SQL query to generate list of all URLs which correspond to records in the table using PRIMARY key field. You may use either absolute or relative URLs in HTDBList command:

For example:


HTDBList SELECT concat('htdb:/',id) FROM messages
    or
HTDBList SELECT id FROM messages

HTDBDoc is a query to get only certain record from database using PRIMARY key value.

HTDBList SQL query is used for all URLs which end with '/' sign. For other URLs SQL query given in HTDBDoc is used.

Note: HTDBDoc query must return FULL HTTP response with headers. So, you can build very flexible indexing system giving different HTTP status in query. Take a look at HTTP response codes section of documentation to understand indexer behavior when it gets different HTTP status.

If there is no result of HTDBDoc or query does return several records, HTDB retrieval system generates "HTTP 404 Not Found". This may happen at reindex time if record was deleted from your table since last reindexing. You may use "DeleteBad yes" to delete such records from mnoGoSearch tables as well.

You may use several HTDBDoc/List commands in one indexer.conf with corresponding Server commands.

HTDB variables

You may use PATH parts of URL as parameters of both HTDBList and HTDBDoc SQL queries. All parts are to be used as $1, $2, ... $n, where number is the number of PATH part:


htdb:/part1/part2/part3/part4/part5
         $1    $2    $3    $4    $5

For example, you have this indexer.conf command:

HTDBList SELECT id FROM catalog WHERE category='$1'

When htdb:/cars/ URL is indexed, $1 will be replaced with 'cars':

SELECT id FROM catalog WHERE category='cars'

You may use long URLs to provide several parameters to both HTDBList and HTDBDoc queries. For example, htdb:/path1/path2/path3/path4/id with query:

HTDBList SELECT id FROM table WHERE field1='$1' AND field2='$2' and field3='$3'

This query will generate the following URLs:


htdb:/path1/path2/path3/path4/id1
...
htdb:/path1/path2/path3/path4/idN

for all values of the field "id" which are in HTDBList output.

Creating full text index

Using htdb:/ scheme you can create full text index and use it further in your application. Lets imagine you have a big SQL table which stores for example web board messages in plain text format. You also want to build an application with messages search facility. Lets say messages are stored in "messages" table with two fields "id" and "msg". "id" is an integer primary key and "msg" big text field contains messages themselves. Using usual SQL LIKE search may take long time to answer:

SELECT id, message FROM message WHERE message LIKE '%someword%'

Using mnoGoSearch htdb: scheme you have a possibility to create full text index on "message" table. Install mnoGoSearch in usual order. Then edit your indexer.conf:

DBAddr mysql://foo:bar@localhost/database/ DBMode single HTDBList SELECT id FROM messages HTDBDoc SELECT concat(\ 'HTTP/1.0 200 OK\\r\\n',\ 'Content-type: text/plain\\r\\n',\ '\\r\\n',\ msg) \ FROM messages WHERE id='$1' Server htdb:/

After start indexer will insert 'htdb:/' URL into database and will run an SQL query given in HTDBList. It will produce 1,2,3, ..., N values in result. Those values will be considered as links relative to 'htdb:/' URL. A list of new URLs in the form htdb:/1, htdb:/2, ... , htdb:/N will be added into database. Then HTDBDoc SQL query will be executed for each new URL. HTDBDoc will produce HTTP document for each document in the form:

HTTP/1.0 200 OK Content-Type: text/plain <some text from 'message' field here>

This document will be used to create full text index using words from 'message' fields. Words will be stored in 'dict' table assuming that we are using 'single' storage mode.

After indexing you can use mnoGoSearch tables to perform search:

SELECT url.url FROM url,dict WHERE dict.url_id=url.rec_id AND dict.word='someword';

As far as mnoGoSearch 'dict' table has an index on 'word' field this query will be executed much faster than queries which use SQL LIKE search on 'messages' table.

You can also use several words in search:

SELECT url.url, count(*) as c FROM url,dict WHERE dict.url_id=url.rec_id AND dict.word IN ('some','word') GROUP BY url.url ORDER BY c DESC;

Both queries will return 'htdb:/XXX' values in url.url field. Then your application has to cat leading 'htdb:/' from those values to get PRIMARY key values of your 'messages' table.

Indexing SQL database driven web server

You can also use htdb:/ scheme to index your database driven WWW server. It allows to create indexes without having to invoke your web server while indexing. So, it is much faster and requires less CPU resources when direct indexing from WWW server.

The main idea of indexing database driven web server is to build full text index in usual order. The only thing is that search must produce real URLs instead of URLs in 'htdb:/...' form. This can be achieved using mnoGoSearch aliasing tools.

Take a look at sample indexer.conf in doc/samples/htdb.conf It is an indexer.conf used to index our webboad.

HTDBList command generates URLs in the form:

http://search.mnogo.ru/board/message.php?id=XXX

where XXX is a "messages" table primary key values.

For each primary key value HTDBDoc command generates text/html document with HTTP headers and content like this:


<HTML>
<HEAD>
<TITLE> ... subject field here .... </TITLE>
<META NAME="Description" Content=" ... author here ...">
</HEAD>
<BODY> ... message text here ... </BODY>

At the end of doc/samples/htdb.conf we wrote three commands:

Server htdb:/ Realm http://search.mnogo.ru/board/message.php?id=* Alias http://search.mnogo.ru/board/message.php?id= htdb:/

First command says indexer to execute HTDBList query which will generate a list of messages in the form:

http://search.mnogo.ru/board/message.php?id=XXX

Second command allow indexer to accept such message URLs using string match with '*' wildcard at the end.

Third command replaces "http://search.mnogo.ru/board/message.php?id=" substring in URL with "htdb:/" when indexer retrieve documents with messages. It means that "http://mysearch.udm.net/board/message.php?id=xxx" URLs will be shown in search result, but "htdb:/xxx" URL will be indexed instead, where xxx is the PRIMARY key value, the ID of record in "messages" table.

Indexing binaries output (exec: and cgi: virtual URL schemes)

mnoGoSearch supports exec: and cgi: virtual URL schemes. They allows running an external program. This program must return a result to it's sdtout. Result must be in HTTP standard, i.e. HTTP response header followed by document's content.

For example, when indexing both cgi:/usr/local/bin/myprog and exec:/usr/local/bin/myprog, indexer will execute the /usr/local/bin/myprog program.

Passing parameters to cgi: virtual scheme

When executing a program given in cgi: virtual scheme, indexer emulates that program is running under HTTP server. It creates REQUEST_METHOD environment variable with "GET" value and QUERY_STRING variable according to HTTP standards. For example, if cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e is being indexed, indexer creates QUERY_STRING with a=b&d=e value. cgi: virtual URL scheme allows indexing your site without having to invoke web servers even if you want to index CGI scripts. For example, you have a web site with static documents under /usr/local/apache/htdocs/ and with CGI scripts under /usr/local/apache/cgi-bin/. Use the following configuration:

Server http://localhost/ Alias http://localhost/cgi-bin/ cgi:/usr/local/apache/cgi-bin/ Alias http://localhost/ file:/usr/local/apache/htdocs/

Passing parameters to exec: virtual scheme

indexer does not create QUERY_STRING variable like in cgi: scheme. It creates a command line with argument given in URL after ? sign. For example, when indexing exec:/usr/local/bin/myprog?a=b&d=e, this command will be executed:

/usr/local/bin/myprog "a=b&d=e"

Using exec: virtual scheme as an external retrieval system

exec: virtual scheme allow using it as an external retrieval system. It allows using protocols which are not supported natively by mnoGoSearch. For example, you can use curl program which is available from http://curl.haxx.se/ to index HTTPS sites.

Put this short script to /usr/local/mnogosearch/bin/ under curl.sh name.


#!/bin/sh
/usr/local/bin/curl -i $1 2>/dev/null

This script takes an URL given in command line argument and executes curl program to download it. -i argument says curl to output result together with HTTP headers.

Now use these commands in your indexer.conf:

Server https://some.https.site/ Alias https:// exec:/usr/local/mnogosearch/etc/curl.sh?https://

When indexing https://some.https.site/path/to/page.html, indexer will translate this URL to

exec:/usr/local/mnogosearch/etc/curl.sh?https://some.https.site/path/to/page.html

execute the curl.sh script:

/usr/local/mnogosearch/etc/curl.sh "https://some.https.site/path/to/page.html"

and take it's output.

Mirroring

You may specify a path to root dir to enable sites mirroring

MirrorRoot /path/to/mirror

You may specify as well root dir of mirrored document's headers indexer will store HTTP headers to local disk too.

MirrorHeadersRoot /path/to/headers

You may specify period during which earlier mirrored files will be used while indexing instead of real downloading.

MirrorPeriod <time>

It is very useful when you do some experiments with mnoGoSearch indexing the same hosts and do not want much traffic from/to Internet. If MirrorHeadersRoot is not specified and headers are not stored to local disk then default Content-Type's given in AddType commands will be used. Default value of the MirrorPeriod is -1, which means do not use mirrored files.

<time> is in the form xxxA[yyyB[zzzC]] (Spaces are allowed between xxx and A and yyy and so on) where xxx, yyy, zzz are numbers (can be negative!). A, B, C can be one of the following:


s - second
M - minute
h - hour
d - day
m - month
y - year

(these letters are the same as in strptime/strftime functions)

Examples:


15s - 15 seconds
4h30M - 4 hours and 30 minutes
1y6m-15d - 1 year and six month minus 15 days
1h-10M+1s - 1 hour minus 10 minutes plus 1 second

If you specify only number without any character, it is assumed that time is given in seconds (this behavior is for compatibility with versions prior to 3.1.7).

The following command will force using local copies for one day:

MirrorPeriod 1d

If your pages are already indexed, when you re-index with -a indexer will check the headers and only download files that have been modified since the last indexing. Thus, all pages that are not modified will not be downloaded and therefore not mirrored either. To create the mirror you need to either (a) start again with a clean database or (b) use the -m switch.

You can actually use the created files as a full featured mirror to you site. However be careful: indexer will not download a document that is larger than MaxDocSize. If a document is larger it will be only partially downloaded. If you site has no large documents, everything will be fine.