Return-Path: Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: (qmail 73878 invoked from network); 17 Nov 2010 21:13:07 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Nov 2010 21:13:07 -0000 Received: (qmail 11677 invoked by uid 500); 17 Nov 2010 21:13:38 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 11613 invoked by uid 500); 17 Nov 2010 21:13:38 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 11605 invoked by uid 99); 17 Nov 2010 21:13:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Nov 2010 21:13:38 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Nov 2010 21:13:33 +0000 Received: from marvin by rectangular.com with local (Exim 4.63) (envelope-from ) id 1PIpJX-0007nB-T7; Wed, 17 Nov 2010 13:13:11 -0800 Date: Wed, 17 Nov 2010 13:13:11 -0800 To: "KinoSearch discussion list." Cc: lucy-dev@incubator.apache.org Message-ID: <20101117211311.GA29884@rectangular.com> References: <4CE2AE2E.8030204@aevum.de> <20101116235239.GA26025@rectangular.com> <4CE3CA0B.3010909@aevum.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4CE3CA0B.3010909@aevum.de> User-Agent: Mutt/1.5.13 (2006-08-11) From: Marvin Humphrey Subject: [lucy-dev] Re: [KinoSearch] Highlighter Bug (cc'ing lucy-dev) On Wed, Nov 17, 2010 at 01:26:51PM +0100, Nick Wellnhofer wrote: > On 17/11/2010 00:52, Marvin Humphrey wrote: > >On Tue, Nov 16, 2010 at 05:15:42PM +0100, Nick Wellnhofer wrote: > >> > >>I found a bug in the highlighter in KinoSearch 0.30_121 and 0.31. For > >>some searches the excerpt doesn't contain any of the search terms, but > >>only a sentence following the one that looks like it should have been > >>picked. > >> > >>I had a quick look at the code in > >>core/KinoSearch/Highlight/Highlighter.c and the test "candidate>= top" > >>in Highlighter_raw_excerpt looks fishy to me. AFAICS this might cause > >>sentences to be skipped. > > > >You're probably right that that's where we're skipping ahead, but it's not > >clear what aspect of the input is causing Raw_Excerpt() to malfunction. > >The check inside that block is supposed to verify that the chosen excerpt > >contains at least some relevant material -- that's what the S_has_heat() > >call is there for. > > I had a closer look and the error case is that you have three sentences > where only the first and the last contain keywords but the middle one is > chosen for the excerpt. > > >This will be hard to debug without sample material. Once we have a > >document which triggers the bad behavior, we can start throwing in > >debugging printf's. If you can supply me with sample code which has the > >problem, I'll have a good hard look. > > See the attached test case. > > Nick Thank you for the test case! It illustrates the problem well and I have started investigating. Since Lucy has inherited KinoSearch's highlighter code, the problem also exists in Lucy and I have opened an issue there. https://issues.apache.org/jira/browse/LUCY-126 Can you please create a JIRA account for yourself and attach your test case to that issue as a contribution to the Apache Software Foundation, so that we can use it for Lucy as well? Just as is is fine, no need to adapt it for Lucy specifically. (Nick is already familiar with the intellectual property issues involved, but for the rest of the KS list, here's an explanation: When a contribution arrives on the KinoSearch mailing list, we may assume that it is intended for use in the KinoSearch project under KinoSearch's GPL/Artistic license. This license is more restrictive than Lucy's Apache License 2.0; we can't use such contributions for Lucy without an additional step granting an additional license for the code to the ASF.) Marvin Humphrey > #! perl > use strict; > > use KinoSearch; > > my $schema = KinoSearch::Schema->new; > > my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new( > language => 'de', > ); > > $schema->spec_field( > name => 'url', > type => KinoSearch::FieldType::StringType->new(), > ); > $schema->spec_field( > name => 'title', > type => KinoSearch::FieldType::FullTextType->new( > analyzer => $polyanalyzer, > boost => 3, > highlightable => 1, > ), > ); > $schema->spec_field( > name => 'bodytext', > type => KinoSearch::FieldType::FullTextType->new( > analyzer => $polyanalyzer, > highlightable => 1, > ), > ); > > my $indexer = KinoSearch::Indexer->new( > schema => $schema, > index => 'debug_index', > create => 1, > truncate => 1, > ); > > $indexer->add_doc( > doc => { > url => 'urn:test', > title => 'Test', > bodytext => <<'EOF', > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla NNN bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla MMM bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla. > EOF > }, > ); > > $indexer->commit(); > > my $searcher = KinoSearch::Searcher->new( > index => 'debug_index' > ); > my $query = 'nnn mmm'; > my $bodytext_hl = KinoSearch::Highlight::Highlighter->new( > searcher => $searcher, > query => $query, > field => 'bodytext' > ); > my $hits = $searcher->hits( > query => $query, > offset => 0, > num_wanted => 10, > ); > > while(my $hit = $hits->next()) { > print($bodytext_hl->create_excerpt($hit), "\n"); > } > > _______________________________________________ > kinosearch mailing list > kinosearch@rectangular.com > http://rectangular.com/cgi-bin/mailman/listinfo/kinosearch