Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 92515 invoked from network); 21 Jun 2007 01:12:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Jun 2007 01:12:41 -0000 Received: (qmail 30946 invoked by uid 500); 21 Jun 2007 01:12:37 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 30916 invoked by uid 500); 21 Jun 2007 01:12:37 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 30905 invoked by uid 99); 21 Jun 2007 01:12:37 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Jun 2007 18:12:37 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of markrmiller@gmail.com designates 64.233.184.235 as permitted sender) Received: from [64.233.184.235] (HELO wr-out-0506.google.com) (64.233.184.235) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Jun 2007 18:12:32 -0700 Received: by wr-out-0506.google.com with SMTP id i4so364408wra for ; Wed, 20 Jun 2007 18:12:11 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=GqL/oXrEAjYaIWxXBcgJhaPwdqaQvrGD/wgqyYNamQwS1HmaKAspvPTW282RtMYlVBGeGTIB3BBQLUmJml7OEj4H+13yEeXEM5uFwoSh6PQ2bSaA2+meH2+HRm2ZC+yC/+aeWaDrcK7uMdhHPEjRP7kqASGK8i8vjS0aG9sBkHc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=rXD6iC4ArEuHP7D/CbHRVlqKW2vniKBkWelwsUMSx0+nAvBbskF069jZOE5BBT26jPwtdJenYTQbkgwwE3ZMb67qKPjpmJhTTEloXeeyy6ZVzEfh57G7cfMt8CGVnWomVubprsZbC57MuvaXWetTpBBcQ2uUZtavhDDvoexTY5k= Received: by 10.100.166.14 with SMTP id o14mr793559ane.1182388330906; Wed, 20 Jun 2007 18:12:10 -0700 (PDT) Received: from ?192.168.1.100? ( [216.66.114.204]) by mx.google.com with ESMTP id c37sm2132755ana.2007.06.20.18.12.09 (version=SSLv3 cipher=RC4-MD5); Wed, 20 Jun 2007 18:12:10 -0700 (PDT) Message-ID: <4679D058.9040403@gmail.com> Date: Wed, 20 Jun 2007 21:11:52 -0400 From: Mark Miller User-Agent: Thunderbird 2.0.0.4 (Windows/20070604) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Highlighter that works with phrase and span queries References: <280820.27409.qm@web50310.mail.re2.yahoo.com> In-Reply-To: <280820.27409.qm@web50310.mail.re2.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I will work up some performance numbers over the next day or two to share with you. I have spent the last day or two with a profiler trying to find the biggest performance drains. Unfortunately, I will probably not be able to squeeze out much more performance than the current Highlighter. When I started working on this project I considered starting from scratch to create a better, more accurate Highlighter. After some initial work I quickly came to the realization that Mark Harwood (with some additions by others) had already solved too many corner cases and interesting needs. The few alternate Highlighters in JIRA did not meet the standards set by Mark's highlighter. Trying to replicate all that work in a different manner didn't seem like a fruitful approach -- Harwood is more clever than I Taking that into account, I decided to extend the Highlighter using the great framework that is already in place. I implemented a new Scorer that acts much like the default Scorer, but when it finds a Query clause that is position sensitive (PhraseQuery, SpanQuery), it creates a MemoryIndex that is used extract the correct Spans for the Query (Credit to Paul Elschot and Mark Harwood for the approach). Non position sensitive Query claueses are handled similar to the way they where in the original highlighter's Scorer. This means that non position sensitive queries are likely the same speed as before, while position sensitive queries are likely a bit slower. For my uses, the thing is damned fast -- of course my uses involves small documents (Newspaper articles). I am very interested in making this thing as fast as possible though, so I will build some benchmark tests and try to squeeze as much performance out of the Highligher as I can. I will also see if my Scorer is any faster than the original. All that said, my guess is that one of the slowest parts of Highlighting is re-tokenizing the text. There is always the option of turning on TermVectors and using org.apache.lucene.search.highlight.TokenSources to get the TokenStream. Based on Mark H's comments, it may be twice as fast as re-tokenizing. This method can also be used with my new Highlighter code as well (which is more a plug-in to the old Highlighter than a replacement.) Considering that both of your comments immediately went to performance, I will certainly be spending some time working on this. - Mark > Hi Mark, > > I know one large user (meaning: high query/highlight rates) of the current Highlighter and this user wasn't too happy with its performance. I don't know the details, other than it was inefficient. So now I'm wondering if you've benchmarked your Highlighter against that/current Highlighter to see not only which one is more accurate, but also which one is faster, and by how much? > > Thanks, > Otis > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > This is really great, Mark. I'll look into integrating it with Solr, > as better phrase highlighting is a definite sore point for some of our > users. > > > > Any indication on performance differences? > > > > cheers, > > -mike > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org