incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Feature question about Lucy vs. Ferret
Date Fri, 25 Feb 2011 16:12:28 GMT
On Wed, Feb 23, 2011 at 05:14:49PM +0000, Andrew S. Townley wrote:
> Well, actually, I want it for more than that.  For my particular needs, I
> need to get the field name where the match occurred in the document, and
> then I'd ideally like to have the start offset into that field and the
> length of the match.
 
> The Ferret::Search::Hit gives me the document number and the score, but
> that's it.  In whatever list format the results are actually in, I'd also
> like to have the information I mentioned.  If you weren't storing the offset
> information, then it would make sense for it not to be available, but if you
> were, then I'd expect to have the whole thing right there.  I can't see how
> there'd be a performance issue in providing this information.

You have to generate that information after the fact, by post-processing the
Hits that come back.  Lucy, Lucene, and Ferret all have the same behavior in
this regard.

Matching and scoring are highly abstracted for speed.  The matching engine
does not scan raw document content, a la an RDBMS full table scan -- instead,
it iterates over heavily optimized data structures devoid of introspection
overhead.  At the end of a search, you will only have documents and scores --
not sophisticated metadata about what part of the subquery matched and what
parts didn't and how much each matching part contributed to the score.
Keeping track of such metadata during the matching phase would be
prohibitively expensive.

In Lucy, our highlighting capabilities are powered by the Highlight_Spans()
method, which is invoked on a derivative of the Query object:

    /** Return an array of Span objects, indicating where in the given
     * field the text that matches the parent query occurs.  In this case,
     * the span's offset and length are measured in Unicode code points.
     * The default implementation returns an empty array.    
     *   
     * @param searcher A Searcher.
     * @param doc_vec A DocVector.
     * @param field The name of the field.
     */  
    public incremented VArray*
    Highlight_Spans(Compiler *self, Searcher *searcher, 
                    DocVector *doc_vec, const CharBuf *field);

Perhaps that might be of use for you.

> Part of the reason I ask has to do with the future of my own project.  Much
> of what I have now will eventually be rewritten piecemeal in C++ and then
> wrapped via SWIG so I can have Ruby and Java bindings as well as use it in
> other environments natively supporting C/C++.  Whatever route I end up going
> for fulltext, this is something that would need to support the same kind of
> thing as I'd actually be leveraging it more from the C++ code than the Ruby
> code.

I concur with Nate that this is exactly the kind of project that we would like
to support with Lucy.

> With the way the statement above is phrased, it seems like this wouldn't
> really be possible.  It also seems like there might be an awful lot of
> duplication of effort involved in actually creating each language binding.
> Why was this approach chosen rather than put all the muscle in the C code
> and provide thin wrappers--even via SWIG or something more hand-tailored
> where necessary/appropriate?

>From the very start we've been determined to make Lucy's bindings feel like
native code in the host language, so that users would feel as at home as
possible.  However, we've changed our approach over the years.  Now nearly
everything's in C, but we've modified our object model to make e.g. native
subclassing transparent and easy.  This approach has proven highly successful;
most KinoSearch power users do some degree of subclassing, and a number of
projects have been published on CPAN.

> I tried to dig through the lucy SVN repository via the web UI, but I
> couldn't really figure out what's there.  The code generator framework
> you're using is something I haven't seen before, but at least it explains
> why I couldn't find the Ruby bindings! :)

There's a short high-level introduction to the Lucy codebase here:

  http://svn.apache.org/repos/asf/incubator/lucy/trunk/core/Lucy/Docs/DevGuide.cfh

> Presently tinkering with the Ferret internals since it seems like  there
> ought to be a way to expose what I want (it's in the explain output)

That might work.  Most people use the Explanation API for tuning and
troubleshooting, though; it might prove a little expensive or unwieldy for
what you're doing.

Marvin Humphrey


Mime
View raw message