lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew S. Townley" <...@atownley.org>
Subject Re: [lucy-user] Feature question about Lucy vs. Ferret
Date Sat, 26 Feb 2011 13:17:15 GMT
Hi Marvin,

On 25 Feb 2011, at 4:12 PM, Marvin Humphrey wrote:

> On Wed, Feb 23, 2011 at 05:14:49PM +0000, Andrew S. Townley wrote:
>> Well, actually, I want it for more than that.  For my particular needs, I
>> need to get the field name where the match occurred in the document, and
>> then I'd ideally like to have the start offset into that field and the
>> length of the match.
> 
>> The Ferret::Search::Hit gives me the document number and the score, but
>> that's it.  In whatever list format the results are actually in, I'd also
>> like to have the information I mentioned.  If you weren't storing the offset
>> information, then it would make sense for it not to be available, but if you
>> were, then I'd expect to have the whole thing right there.  I can't see how
>> there'd be a performance issue in providing this information.
> 
> You have to generate that information after the fact, by post-processing the
> Hits that come back.  Lucy, Lucene, and Ferret all have the same behavior in
> this regard.
> 
> Matching and scoring are highly abstracted for speed.  The matching engine
> does not scan raw document content, a la an RDBMS full table scan -- instead,
> it iterates over heavily optimized data structures devoid of introspection
> overhead.  At the end of a search, you will only have documents and scores --
> not sophisticated metadata about what part of the subquery matched and what
> parts didn't and how much each matching part contributed to the score.
> Keeping track of such metadata during the matching phase would be
> prohibitively expensive.

I can understand the need to abstract a lot of things for speed.  I'm no search expert as
I've said before, but I don't understand why at the very least the field information (e.g.
name) can't be encoded in this data structure in such a way that you can determine this information
at match time.  Highlighting and offsets are a different matter, and I never thought it was
doing a full-text scan or a table scan like an RDBMS.  If I wanted that, I'd just use regex
searches (which I do in some cases for small datasets).

Obviously, I'm missing something here, but to me I don't see why it matters to keep track
of fields at all if you don't have the information about which field matched an "all fields"
or "multiple field" search query to hand when you get the match information back in terms
of term and field.  Obviously, actually finding the offsets is a much more expensive operation,
and I'm ok with having to do that after the search is completed--even if I have to do my own
matching without API support for highlighting.  However, this is only possible if I know what
term and what field and don't have to effectively perform the search again on the document
(which is what Ferret seems to require).

> In Lucy, our highlighting capabilities are powered by the Highlight_Spans()
> method, which is invoked on a derivative of the Query object:
> 
>    /** Return an array of Span objects, indicating where in the given
>     * field the text that matches the parent query occurs.  In this case,
>     * the span's offset and length are measured in Unicode code points.
>     * The default implementation returns an empty array.    
>     *   
>     * @param searcher A Searcher.
>     * @param doc_vec A DocVector.
>     * @param field The name of the field.
>     */  
>    public incremented VArray*
>    Highlight_Spans(Compiler *self, Searcher *searcher, 
>                    DocVector *doc_vec, const CharBuf *field);
> 
> Perhaps that might be of use for you.

This API has the same problem as Ferret--if I don't know what field, then I've got to try
all the fields (maybe > 20 in some cases) on the document.  If you need this information
to display to users, then it doesn't matter how fast the search is if you're going to slow
down the whole interaction by checking between 2-x fields * the number of matches in the results
chunk you're processing.

The advantages of the fulltext search capabilities exposed via a query language like FQL or
whatever Lucy uses is that you can effectively defer all of the introspection/heavy lifting
of the searching and results matching to the underlying fulltext system (or, at least that's
the way I see it).  If you then don't have enough information available to fully describe
the matches in an efficient way, then the only other option you have is to both pre-process
the query to see if any explicit fields are present, and then, if not, try all of the fields
indexed to see if they happen to match (effectively performing the search again over the result
set).

Maybe I'm using it wrong, or maybe I just don't get it, but these are the kinds of things
I need to do.

[snip]

>> I tried to dig through the lucy SVN repository via the web UI, but I
>> couldn't really figure out what's there.  The code generator framework
>> you're using is something I haven't seen before, but at least it explains
>> why I couldn't find the Ruby bindings! :)
> 
> There's a short high-level introduction to the Lucy codebase here:
> 
>  http://svn.apache.org/repos/asf/incubator/lucy/trunk/core/Lucy/Docs/DevGuide.cfh

You weren't kidding about the "short" part! :)  Still, thanks for the pointer.  I'd seen it
earlier.

>> Presently tinkering with the Ferret internals since it seems like  there
>> ought to be a way to expose what I want (it's in the explain output)
> 
> That might work.  Most people use the Explanation API for tuning and
> troubleshooting, though; it might prove a little expensive or unwieldy for
> what you're doing.

After spending about 12-14 hours trying to get my head around the code and the way the searching
worked, I gave up.  There wasn't a good, consistent API abstraction that allowed you to access
the same information from the internals of the search code that were leveraged by the explain
code--and the fact that explain is overloaded for each subclass, but not in a universal way
would've required more surgery than I was prepared to do at the C level given the time I have.
 Jens took an alternative approach and implemented some changes at the Ruby level.  These
helped, but they  still required some tweaking to be used by both the Searcher API and the
Index API since again, some of the information available for the index isn't available for
the Searcher API.

For now, thanks to Jens' patch, I have the capability to do what I need to do with Ferret--even
if it isn't as fast as it could be.  However, unless the same type of information is exposed
at an API level in Lucy, the same kinds of workarounds would be required to use Lucy instead
of Ferret for my application.

At least my Wed turned out not to be a total wasted day after all! :)

Cheers for all the information,

ast
--
Andrew S. Townley <ast@atownley.org>
http://atownley.org


Mime
View raw message