lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: ArrayIndexOutOfBoundsException: -65536
Date Wed, 15 Oct 2014 21:54:48 GMT
On Tue, Oct 14, 2014 at 1:29 AM, Trejkaz <trejkaz@trypticon.org> wrote:

> Bit of thread necromancy here, but I figured it was relevant because
> we get exactly the same error.

Wow, blast from the past ...

>> Is it possible you are indexing an absurdly enormous document...?
>
> We're seeing a case here where the document certainly could qualify as
> "absurdly enormous". The doc itself is 2GB in size and the
> tokenisation is per-character, not per-word, so the number of
> generated terms must be enormous. Probably enough to fill 2GB...
>
> So I'm wondering if there is more info somewhere on why this is (or
> was? We're still using 3.6.x) a limit and whether it can be detected
> up-front. Some large amount of indexing time (~30 minutes) could be
> avoided if we can detect that it would have failed ahead of time.

The limit is still there; it's because Lucene uses an int internally
to address its memory buffer.

It's probably easiest to set a limit on the max sized doc you will
index?  Or, use LimitTokenCountFilter (available in newer releases) to
only index the first N tokens...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message