httpd-docs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Slive <jos...@slive.ca>
Subject search.apache.org (fwd)
Date Wed, 30 Apr 2003 21:47:31 GMT
I haven't tried this out in detail, so I don't have any specific opinion.
But Bill has spent some time sorting out searching on apache.org, so we
may want to consider returning our local search engine.

Joshua.


---------- Forwarded message ----------
Date: Wed, 30 Apr 2003 14:45:32 -0700
From: Bill Moseley <moseley@hank.org>
To: infrastructure@apache.org
Subject: search.apache.org

http://search.apache.org has been updated with the new index as was described here a week
or
so ago.

The new index is a result of spidering instead of indexing the file system -- the index
files are about 11MB instead of 200MB.  Frees a little space on /x2, which is up at 95% full
again.

There may be things that don't need to be indexed (cvs update archives?), so let me know if
anything else should be excluded.  Or make use of robots.txt.

Site specific searches can be done by setting the "what" CGI parameter.  For example:

  http://search.apache.org/index.cgi?what=httpd&keyword=installation
or
  http://search.apache.org/index.cgi?what=docs2&keyword=installation

limit to just httpd.apache.org or 2.0 docs.

The "advanced" form is just:

  http://search.apache.org/index.cgi

which just allows searching over the entire site, search by field, and set sort order.

There's two other features that are not shown by default.  One is to select "fuzzy"
searching, and the other is to limit searches by a data range.  I'm not sure they
need to be enabled at all.

Those features can be tested by:

  http://search.apache.org/index.cgi?full=1



And to ramble a bit...

This is all in Perl CGI, which is slow.  Plus, the highlighting code is set in the most
aggressive mode, and that's where most of the time is spent.  It's brut-force highlighting.

The CGI script runs the swish-e binary for searches, but only if there are not more than 4
swish-e binaries running as found by grepping the output from /bin/ps -Unobody -ocommand.
Still, hitting the CGI script hard will load the server, no doubt.

Running under mod_perl would help, especially with highlighting turned off or down, and
using the swish-e C library (via SWISH::API module) instead of the swish-e binary.

Here's some general request/second on an Athlon XP 1800+ with 1/2GB RAM, Linux 2.4.20
and Apache/1.3.26 mod_perl/1.26 using ab.

                             Requests per Second

                              Highlighting Mode
                      Off      Phrase    Default     Simple
   Using SWISH::API   45        1.5        2          12
   ----------------------------------------------------------------------------
   Using swish-e      12        1.3       1.8         7.5
     binary

As you can see the highlighting code is the limiting factor.  I have search.apache.org setup
for the swish-e binary and "Phrase" highlighting.  The worst combination. ;)


-- 
Bill Moseley
moseley@hank.org



---------------------------------------------------------------------
To unsubscribe, e-mail: docs-unsubscribe@httpd.apache.org
For additional commands, e-mail: docs-help@httpd.apache.org


Mime
View raw message