Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 66643 invoked from network); 2 Aug 2006 12:13:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 2 Aug 2006 12:13:51 -0000 Received: (qmail 94932 invoked by uid 500); 2 Aug 2006 12:13:48 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 94735 invoked by uid 500); 2 Aug 2006 12:13:48 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 94719 invoked by uid 99); 2 Aug 2006 12:13:48 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Aug 2006 05:13:48 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Aug 2006 05:13:47 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C7038410021 for ; Wed, 2 Aug 2006 12:11:14 +0000 (GMT) Message-ID: <17458776.1154520674812.JavaMail.jira@brutus> Date: Wed, 2 Aug 2006 05:11:14 -0700 (PDT) From: "Ronnie Kolehmainen (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-644) Contrib: another highlighter approach In-Reply-To: <8600211.1154505253958.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/LUCENE-644?page=comments#action_12425208 ] Ronnie Kolehmainen commented on LUCENE-644: ------------------------------------------- Mark, I believe the performance gain is mostly because TokenSources.getTokenStream iterates *all* terms in the field, no? I've tested against a real index we have with theses and dissertations in fulltext. It's a 1.1GB compound file. I think I will send you the code used for the test and the output instead of posting it here. > Contrib: another highlighter approach > ------------------------------------- > > Key: LUCENE-644 > URL: http://issues.apache.org/jira/browse/LUCENE-644 > Project: Lucene - Java > Issue Type: Improvement > Components: Other > Reporter: Ronnie Kolehmainen > Priority: Minor > Attachments: FulltextHighlighter.java, FulltextHighlighterTest.java, svn-diff.patch > > > Mark Harwoods highlighter package is a great contribution to Lucene, I've used it a lot! However, when you have *large* documents (fields), highlighting can be quite time consuming if you increase the number of bytes to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is often too low for indexed PDFs etcetera, which results in empty highlight strings. > This is an alternative approach using term position vectors only to build fragment info objects. Then a StringReader can read the relevant fragments and skip() between them. This is a lot faster. Also, this method uses the *entire* field for finding the best fragments so you're always guaranteed to get a highlight snippet. > Because this method only works with fields which have term positions stored one can check if this method works for a particular field using following code (taken from TokenSources.java): > TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, field); > if (tfv != null && tfv instanceof TermPositionVector) > { > // use FulltextHighlighter > } > else > { > // use standard Highlighter > } > Someone else might find this useful so I'm posting the code here. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org