lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Highlighter IOOBE with modified HyphenationCompoundWordTokenFilter
Date Thu, 04 Oct 2012 13:41:16 GMT
Hi,

I've modified the HyphenationCompoundWordTokenFilter to emit less subtokens because the original
filter can emit all kinds of subtokens that have a very different meaning on their own. I've
modified it so no overlapping subtokens are emitted and no subtokens are emitted that can
be found within another subtoken. I've also modified it to force that the generated subtokens
comprise the original token and if they don't forget the subtokens. It also doesn't return
the original token anymore, the original filter produces a duplicate of the original input
token. For example: verzekeringmaatschappij now becomes verzekering and maatschappij and not
verzekeringmaatschappij, ver, zeker, verzeker, zekering, ringmaat, maat and more.

But it seem that i have done something wrong because my modified version sometimes causes
the Highlighter to throw the following IOOBE:

java.lang.StringIndexOutOfBoundsException: String index out of range: -14
        at java.lang.String.substring(String.java:1937)
        at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.makeFragment(BaseFragmentsBuilder.java:172)
        at org.apache.lucene.search.vectorhighlight.BaseFragmentsBuilder.createFragments(BaseFragmentsBuilder.java:138)
        at org.apache.lucene.search.vectorhighlight.FastVectorHighlighter.getBestFragments(FastVectorHighlighter.java:186)
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByFastVectorHighlighter(DefaultSolrHighlighter.java:571)
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
        at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:136)
        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:214)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1750)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
        .....

Anyone to point me in the right direction? I've checked the LIA book on how to manipulate
the tokenstream and thought it should be alright. My analysis tests also yield good results,
nothing strange to be found. Or could it be an error in the highlighter that only now shows
up?

Thanks,
Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message