lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Re: limitation on token-length for KeywordAnalyzer?
Date Tue, 28 Jan 2014 06:53:07 GMT
On Mon, Jan 27, 2014 at 3:48 AM, Andreas Brandl <ml@3.141592654.de> wrote:
> Is there some limitation on the length of fields? How do I get around this?
[cut]
> My overall goal is to index (arbitrary sized) text files and run a regular
> expression search using lucene's RegexpQuery. I suspect the
> KeywordAnalyzer to cause the inconsistent behaviour - is this the right
> analyzer to use for a RegexpQuery?

The limit is most likely that one in DocumentsWriter, where
MAX_TERM_LENGTH == 16383. addDocument() says it throws an error when
this limit is exceeded.

What we do for RegexpQuery is that we still tokenise the text, but we
explain the caveat that the regular expression will match individual
tokens. I think KeywordAnalyzer will probably give better results, but
you're going to hit this limitation past a certain size.

Going in the other direction, if you tokenise the text
character-by-character, you might be able to write a regular
expression engine which uses span queries to match the regular
expression to the terms. I don't know how that would perform, but ever
since writing a per-character tokeniser, I have been wondering if it
would be a decent way to do it.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message