Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2C01C987C for ; Thu, 4 Oct 2012 13:38:14 +0000 (UTC) Received: (qmail 91215 invoked by uid 500); 4 Oct 2012 13:38:12 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 90974 invoked by uid 500); 4 Oct 2012 13:38:11 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 90960 invoked by uid 99); 4 Oct 2012 13:38:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2012 13:38:11 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=MIME_QP_LONG_LINE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [178.21.113.82] (HELO mail.openindex.io) (178.21.113.82) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2012 13:38:04 +0000 Received: from localhost (localhost [127.0.0.1]) by mail.openindex.io (Postfix) with ESMTP id 97FA8FC002 for ; Thu, 4 Oct 2012 13:41:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at mail.openindex.io Received: from mail.openindex.io ([127.0.0.1]) by localhost (mail.openindex.io [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id a0p311ikiqYf for ; Thu, 4 Oct 2012 13:41:19 +0000 (UTC) Received: from mail.openindex.io (localhost [127.0.0.1]) by mail.openindex.io (Postfix) with ESMTP id 63CEBFC001 for ; Thu, 4 Oct 2012 13:41:17 +0000 (UTC) Subject: Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter From: =?utf-8?Q?Markus_Jelsma?= To: =?utf-8?Q?java-user=40lucene=2Eapache=2Eorg?= Date: Thu, 4 Oct 2012 13:41:16 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 (Normal) X-Mailer: Zarafa 7.0.7-34256 Message-Id: X-Virus-Checked: Checked by ClamAV on apache.org Hi, I've modified the HyphenationCompoundWordTokenFilter to emit less subtokens because the original filter can emit all kinds of subtokens that have a very different meaning on their own. I've modified it so no overlapping subtokens are emitted and no subtokens are emitted that can be found within another subtoken. I've also modified it to force that the generated subtokens comprise the original token and if they don't forget the subtokens. It also doesn't return the original token anymore, the original filter produces a duplicate of the original input token. For example: verzekeringmaatschappij now becomes verzekering and maatschappij and not verzekeringmaatschappij, ver, zeker, verzeker, zekering, ringmaat, maat and more. But it seem that i have done something wrong because my modified version sometimes causes the Highlighter to throw the following IOOBE: java.lang.StringIndexOutOfBoundsException: String index out of range: -14 at java.lang.String.substring(String.java:1937) at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.makeFragment(BaseFragmentsBuilder.java:172) at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragments(BaseFragmentsBuilder.java:138) at org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getBestFragments(FastVectorHighlighter.java:186) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByFastVectorHighlighter(DefaultSolrHighlighter.java:571) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:136) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:214) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1750) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) ..... Anyone to point me in the right direction=3F I've checked the LIA book on how to manipulate the tokenstream and thought it should be alright. My analysis tests also yield good results, nothing strange to be found. Or could it be an error in the highlighter that only now shows up=3F Thanks, Markus --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org