incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Lucy questions wrt production, ranking, etc
Date Thu, 08 Sep 2011 20:05:09 GMT
On Thu, Sep 08, 2011 at 03:53:58PM +0200, goran kent wrote:
> Early-adopter here.

Thanks for considering Lucy for your project.  It's true that Lucy's first
Apache release happened only a few months ago -- however, since the code base
that is now Lucy has been in development since 2005 or so, you probably don't
have to worry so much about archetypal early adopter concerns.

> I'm considering Lucy for a new project (and I must say, the docs are
> nice and it's Perl/C which is always welcome in this day and age).

:)
 
> So,... I gather from the mailing list that it's production ready, but
> officially API-unstable.  Does API-unstable mean the index format may
> change any time soon, eg, before the first stable release?

The possibility exists.   At this time, there are no incompatible changes
being considered, and given the cost to our users, it's not something we ever
do lightly.  However, no one is actively working to prepare a stable release,
either.  Development is presently focusing on making Lucy available from
languages other than Perl.

> I see from the docs that distributed search is supported, else it
> would be a non-starter.

If you like, take a look at LucyX::Remote::SearchServer and
LucyX::Remote::SearchClient.  They're implemented in pure Perl, so Perl
programmers generally find them easier to grok than other parts of the
library.

> Ranking
> -------
> I need to sort results based on a floating point value (actually
> several).  I see Lucy supports this.

Lucy supports sorting on the lexical value of text fields.  You can use
sprintf() or the like to stringify floats so that field values sort lexically
in the same order as they would numerically.

(I'm not sure how deep your investigations have gone, but if you are referring
to numeric FieldTypes: the underlying machinery is there, but it has not been
made public, and should be considered experimental and subject to change.)

> By how much does custom sorting impact search performance?

Search-time performance of sorting by field value is tyically *faster* than
sorting by relevance.

We build optimized structures at index-time (arrays of ordinals mapping doc
ids to rank order), then mmap() those data structures at search-time.  Because
mmap is near instantaneous, opening a new Searcher is very quick, and new data
can be available for searching almost immediately after a commit completes.
Enabling sorting has negligible impact on opening a Searcher, even for large
indexes with many sortable fields.

Then, during the actual search, looking up an ordinal in array is usually
faster than calculating a score.

> What about term proximity in documents?  Will a matching document rank
> higher than another if two (or whatever being searched for) terms are
> physically located closer together?  Or is ranking based only on a
> term count ignoring positional info?

Ranking does not consider positional info.  This was discussed on lucy-dev a
couple weeks ago:

    http://markmail.org/message/3lmsixphxyrjveta

> Does Lucy consider the relative importance of the search terms
> themselves?  For example, searching for [a b c d] would imply that
> those terms' importance declines from left to right, with 'a' being
> the most important, etc.  I think there was a Page/Brin paper on this
> somewhere on the 'tubes.
 
It would be possible to write a query parser which does this, but it is not
built into the default parser.
 
> Phrase searches
> ---------------
> I see this is supported.  Hard to quantify, I know, but by what factor
> is phrase-searching slower than an equivalent term search?

Ironically, not that much.  That's actually because positions are iterated
while performing an ordinary term search (they are inlined into the posting
format).  Once that is changed, term queries will be sped up and the gap will
widen.

> Spelling suggestions
> --------------------
> I may have missed this one in the docs:  does Lucy support suggested
> spelling (a-la Google).  One could always use a dictionary, but it
> would be nice if Lucy built up a dictionary based on the terms
> encountered during indexing.

No, Lucy does not provide spelling suggestion facilities.

> Merging/optimization
> --------------------
> Merging multiple indexes into larger ones is supported.  I see there
> is also an 'optimize' for faster searching; can one update an index
> with newer pages after such an optimization, or is it a one-way
> street?

The word "optimize" is largely vestigial.  It doesn't do much, if anything, to
improve search-time performance.  It certainly doesn't freeze an index and
prevent you from making subsequent modifications.

> Index checking/verification
> ---------------------------
> In a cluster environment all kinds of things go wrong on a weekly
> basis - when this happens during indexing or merging indexes can be
> left in a broken state leading to problems in batch processing.  Does
> Lucy have an index-verifier (a-la fsck) to scan an index and report
> errors (not fix, just check and report)?
 
No such tool is bundled.  I agree that it would be nice to have, and at $work
we have our own internal version of such a tool.  However, FWIW, we've been
running hundreds of indexes spread out across tens of machines for a couple
years now, and random index corruption has not been something we've had to
deal with.  (Our tool checks for logical data consistency, not file
integrity.)

> Which version?
> --------------
> With index format stability being important, which version should I
> consider using?  0.2.x incubating, or trunk?

As far as index compatibility goes, there's no difference.  Nothing has
changed.

While we haven't put this to the test, Lucy will theoretically read old
KinoSearch 0.3x indexes, despite the namespace change.  These are the only
backwards breaks we've made in the last couple years:

  * Upgrade the Snowball stemming library (which affected German indexes).
  * Disallow "\p" constructs in Tokenizer regexes because of a Perl security
    problem affecting untrusted indexes.
  * Change the default value for "stored" in BlobType.

We've often made *forwards-incompatible* changes so that you can't roll back
to an older library version, but backwards-incompatible breaks are rare.

> Language/binding
> ----------------
> I see Perl can be used during indexing/searching, how about PHP on the
> search side?  Presumably PHP bindings (for search-related bindings at
> least) are on the horizon/done?  Not that important, just wondering.

Since we're working on bindings in general, this is getting easier.  However,
I don't know of any volunteer with both time and the itch to work on PHP
specifically.

> Scale
> -----
> Anyone using Lucy on a sizeable index split across nodes in a cluster?
> By sizeable I mean > 1-2TB.

I don't know of anyone who is using Lucy with a corpus of that size.

> If so, how's your search times (yes, I know, it depends on
> caching/memory/IO/CPUs/#nodes)?

Lucy's searching is fast, but could be considerably faster.  There are a lot
of known approaches which have simply not been applied because no developer
has made search speed a priority recently.  I could work on that, but frankly,
it's high-reward work that I would rather someone else got the credit for.

We worked hard on near-real-time search with sorting, so Lucy shines there.
If we get volunteers for whom search speed above and beyond what we have today
is important, we'll shine in that area as well.

Cheers,

Marvin Humphrey


Mime
View raw message