lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From goran kent <gorank...@gmail.com>
Subject [lucy-user] Lucy questions wrt production, ranking, etc
Date Thu, 08 Sep 2011 13:53:58 GMT
Hi,

Early-adopter here.

I'm considering Lucy for a new project (and I must say, the docs are
nice and it's Perl/C which is always welcome in this day and age).

So,... I gather from the mailing list that it's production ready, but
officially API-unstable.  Does API-unstable mean the index format may
change any time soon, eg, before the first stable release?

The environment is distributed search across a cluster with the intent
of keeping search-time sub-second - 3s at most (folks are spoilt by
the elephant in the industry, so they lose interest if the page does
not return in that time).

I see from the docs that distributed search is supported, else it
would be a non-starter.


Ranking
-------
I need to sort results based on a floating point value (actually
several).  I see Lucy supports this.  By how much does custom sorting
impact search performance?

What about term proximity in documents?  Will a matching document rank
higher than another if two (or whatever being searched for) terms are
physically located closer together?  Or is ranking based only on a
term count ignoring positional info?

What if the matching terms are physically closer to the TOP of the document?

Does Lucy consider the relative importance of the search terms
themselves?  For example, searching for [a b c d] would imply that
those terms' importance declines from left to right, with 'a' being
the most important, etc.  I think there was a Page/Brin paper on this
somewhere on the 'tubes.


Phrase searches
---------------
I see this is supported.  Hard to quantify, I know, but by what factor
is phrase-searching slower than an equivalent term search?


Spelling suggestions
--------------------
I may have missed this one in the docs:  does Lucy support suggested
spelling (a-la Google).  One could always use a dictionary, but it
would be nice if Lucy built up a dictionary based on the terms
encountered during indexing.


Merging/optimization
--------------------
Merging multiple indexes into larger ones is supported.  I see there
is also an 'optimize' for faster searching; can one update an index
with newer pages after such an optimization, or is it a one-way
street?


Index checking/verification
---------------------------
In a cluster environment all kinds of things go wrong on a weekly
basis - when this happens during indexing or merging indexes can be
left in a broken state leading to problems in batch processing.  Does
Lucy have an index-verifier (a-la fsck) to scan an index and report
errors (not fix, just check and report)?


Which version?
--------------
With index format stability being important, which version should I
consider using?  0.2.x incubating, or trunk?


Language/binding
----------------
I see Perl can be used during indexing/searching, how about PHP on the
search side?  Presumably PHP bindings (for search-related bindings at
least) are on the horizon/done?  Not that important, just wondering.


Scale
-----
Anyone using Lucy on a sizeable index split across nodes in a cluster?
 By sizeable I mean > 1-2TB.  If so, how's your search times (yes, I
know, it depends on
caching/memory/IO/CPUs/#nodes)?


Any comments would be appreciated.

g

Mime
View raw message