Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 93138 invoked from network); 27 Jun 2007 13:22:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Jun 2007 13:22:29 -0000 Received: (qmail 16586 invoked by uid 500); 27 Jun 2007 13:22:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 16557 invoked by uid 500); 27 Jun 2007 13:22:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 16546 invoked by uid 99); 27 Jun 2007 13:22:24 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jun 2007 06:22:24 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of markrmiller@gmail.com designates 209.85.132.245 as permitted sender) Received: from [209.85.132.245] (HELO an-out-0708.google.com) (209.85.132.245) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jun 2007 06:22:19 -0700 Received: by an-out-0708.google.com with SMTP id c5so33563anc for ; Wed, 27 Jun 2007 06:21:58 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=RZS4hWtItjQrVqfj0sHRmkSsEwEUoc0zctEyIjBDbtwcE+uEHWrH7zr44ab0a6M25yQIVRXvQlaSH6MjfwaEVj0ObkW0A9mMgTzrUlTFKnS9QKmgD7/hIH9stmFWkP+9WkeOPiAPygzfqFjUx9WOy7jueZggr6BFLNc9KLCliLk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=DhOGkpJcj3r9yZtl3/f18m/1ZzvbBO/RBGNw1urIqBOBC7zwu4lIe+oGB1Uq5Q5vvCsoxq4/BE8hm4AgS8sc+nBk3ZdQM/L/I9s66B3HKm59hFUeHvkwouI9sWRTZ0LvzsW+nGuXDKXfhuyUi7Qb/gVE5Kb8uMIQCuSYHf63G+U= Received: by 10.100.137.18 with SMTP id k18mr263781and.1182950517955; Wed, 27 Jun 2007 06:21:57 -0700 (PDT) Received: from ?192.168.1.100? ( [216.66.114.204]) by mx.google.com with ESMTP id d12sm9767083and.2007.06.27.06.21.56 (version=SSLv3 cipher=RC4-MD5); Wed, 27 Jun 2007 06:21:57 -0700 (PDT) Message-ID: <46826455.5000105@gmail.com> Date: Wed, 27 Jun 2007 09:21:25 -0400 From: Mark Miller User-Agent: Thunderbird 2.0.0.4 (Windows/20070604) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Highlighter that works with phrase and span queries References: <280820.27409.qm@web50310.mail.re2.yahoo.com> In-Reply-To: <280820.27409.qm@web50310.mail.re2.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Depending on what these guys are doing, here is another possibility if TermOffests and Ronnie's highlighter are not an option. If you are highlighting whole documents (NullFragmenter) or are not very concerned about the fragments you get back, you can change the line in the Highlighter at about 255: tokenGroup.addToken(token, fragmentScorer.getTokenScore(token)); TO: float score = fragmentScorer.getTokenScore(token); if(score > 0 ) { tokenGroup.addToken(token, score); } This is not a full solution yet, but more of a hack. Fragmenters won't be given the opportunity to start a new Fragment at every token position...no problem if you are highlighting the whole document. Essentially, instead of the the document being rebuilt from from the source text using each individual token, it is rebuilt from the highlighted tokens and the differences in offsets between them. No so fragment happy without some Fragmenter handling changes. On a collection of 5,000 documents, 300-900 tokens (weighted toward 300), this gave an improvement of 37-40%. I imagine the gains grow as the document grows. I am looking into making this a more general solution, but it's a great quick hack for speed. It will also work with my SpanScorer that correctly highlights Spans and PhraseQuerys. - Mark Otis Gospodnetic wrote: > Hi Mark, > > I know one large user (meaning: high query/highlight rates) of the current Highlighter and this user wasn't too happy with its performance. I don't know the details, other than it was inefficient. So now I'm wondering if you've benchmarked your Highlighter against that/current Highlighter to see not only which one is more accurate, but also which one is faster, and by how much? > > Thanks, > Otis > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > ----- Original Message ---- > From: Mark Miller > To: java-user@lucene.apache.org > Sent: Wednesday, June 20, 2007 12:39:27 AM > Subject: Highlighter that works with phrase and span queries > > I have been working on extending the Highlighter with a new Scorer that > correctly scores phrase and span queries. The highlighter is working > great for me, but could really use some more banging on. > > If you have a need or an interest in a more accurate Highlighter, please > give it a whirl and let me know how it went. Unlike most of the other > alternate Lucene Highlighters, this one builds off the original contrib > Highlighter so as to retain all of its goodness. > > http://myhardshadow.com/qsolreleases/lucene-highlighter-2.2.jar > > Example Usage > > IndexSearcher searcher = new IndexSearcher(ramDir); > Query query = QueryParser.parse("Kenne*", FIELD_NAME, analyzer); > query = query.rewrite(reader); //required to expand search terms > Hits hits = searcher.search(query); > > for (int i = 0; i < hits.length(); i++) > { > String text = hits.doc(i).get(FIELD_NAME); > CachingTokenFilter tokenStream = new > CachingTokenFilter(analyzer.tokenStream( > FIELD_NAME, new StringReader(text))); > Highlighter highlighter = new Highlighter(new SpanScorer(query, > FIELD_NAME, tokenStream)); > tokenStream.reset(); > > // Get 3 best fragments and seperate with a "..." > String result = highlighter.getBestFragments(tokenStream, text, > 3, "..."); > System.out.println(result); > } > > If you make a call to any of the getBestFragments() methods more than > once, you must call reset() on the SpanScorer between each call. > > Pass null as the FIELD_NAME to ignore fields. > > If you want to Highlight the whole document, use a NullFragmenter. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org