Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 661 invoked from network); 23 Mar 2009 21:10:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Mar 2009 21:10:14 -0000 Received: (qmail 72051 invoked by uid 500); 23 Mar 2009 21:10:12 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 71953 invoked by uid 500); 23 Mar 2009 21:10:12 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 71782 invoked by uid 99); 23 Mar 2009 21:10:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Mar 2009 21:10:12 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Mar 2009 21:10:11 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C4E87234C04C for ; Mon, 23 Mar 2009 14:09:50 -0700 (PDT) Message-ID: <427272464.1237842590805.JavaMail.jira@brutus> Date: Mon, 23 Mar 2009 14:09:50 -0700 (PDT) From: "Mark Miller (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1522) another highlighter In-Reply-To: <1168939453.1232376359569.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688419#action_12688419 ] Mark Miller commented on LUCENE-1522: ------------------------------------- I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries don't highlight correctly. That was/is a common occurrence and you can find tons of examples. There are one or two JIRA highlighters that address it, and the their is the Span highlighter (more interestingly, there is a link to the birth of the Span highlighter idea on that page - thanks M. Harwood). When users see the PhraseQuery look right, I havn't seen any other repeated complaints really. While it would be nice to match boolean logic fully, I almost don't think its worth the effort. You likely have an interest in those terms anyway - its not a given that the terms that caused the match (non positional) matter. I have not seen a complaint on that one - mostly just positional type stuff. And I think we have positional solved fairly well with the current API - its just too darn slow. Not that I am against things being sweet and perfect, and getting exact matches, but there has been lots of talk in the past about integrating the highlighter into core and making things really fast and efficient - and generally it comes down to what work actually gets done (and all this stuff ends up at the hard end of the pool). When I wrote the SpanScorer, many times it was discussed how things should *really* be done. Most methods involved working with core - but what has been there for a couple years now is the SpanScorer that plugs into the current highlighter API and nothing else has made any progress. Not really an argument, just kind of thinking out loud at this point... I'm all for improving the speed and accuracy of the highlighter at the end of the day, but its a tall order considering how much attention the Highlighter has managed to receive in the past. Its large on ideas and low on sweat. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > w1 w2 > --------------- > q="w1 w2"~1 > w1 w3 w2 w3 w1 w2 > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org