lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Brandl ...@3.141592654.de>
Subject Re: limitation on token-length for KeywordAnalyzer?
Date Tue, 28 Jan 2014 11:49:05 GMT
Hi,

----- Original Message -----
> On Mon, Jan 27, 2014 at 3:48 AM, Andreas Brandl <ml@3.141592654.de>
> wrote:
> > Is there some limitation on the length of fields? How do I get
> > around this?
> [cut]
> > My overall goal is to index (arbitrary sized) text files and run a
> > regular
> > expression search using lucene's RegexpQuery. I suspect the
> > KeywordAnalyzer to cause the inconsistent behaviour - is this the
> > right
> > analyzer to use for a RegexpQuery?
> 
> The limit is most likely that one in DocumentsWriter, where
> MAX_TERM_LENGTH == 16383. addDocument() says it throws an error when
> this limit is exceeded.

Thanks, that's it. Although that seems to have changed in newer lucene versions (4.6):

IndexWriter#MAX_TERM_LENGTH:
Absolute hard maximum length for a term. If a term arrives from the analyzer longer than this
length, it is skipped and a message is printed to infoStream, if set (see setInfoStream(java.io.PrintStream)).

For me, MAX_TERM_LENGTH is set to 32766 and that is what I get as a warning:
IW 0 [Tue Jan 28 12:07:30 CET 2014; main]: WARNING: document contains at least one immense
term (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.
 Please correct the analyzer to not produce such terms.  The prefix of the first immense term
is: ...

So that makes perfectly sense.

> 
> What we do for RegexpQuery is that we still tokenise the text, but we
> explain the caveat that the regular expression will match individual
> tokens. I think KeywordAnalyzer will probably give better results,
> but
> you're going to hit this limitation past a certain size.
> 
> Going in the other direction, if you tokenise the text
> character-by-character, you might be able to write a regular
> expression engine which uses span queries to match the regular
> expression to the terms. I don't know how that would perform, but
> ever
> since writing a per-character tokeniser, I have been wondering if it
> would be a decent way to do it.

I've written an indexed regex search engine using Lucene and trigram-tokens based on the idea
of [1]. I'm still looking into performance but so far it seems very good, it even supersedes
an in-memory implementation (all docs in memory, sequential scan using Pattern#matches) in
cases where the regex matching itself is quite costly. If you're curious, I can provide source
code and an evaluation in a couple of weeks (part of my master thesis).

That is why I'm curious to see how Lucene's AutomatonQuery implementation performs compared
to the trigram solution. Though, with the above limit in mind, I guess that can't be compared
for real.

Do you know of any other ways to do efficient regex search using Lucene or any other method?

Thanks,

Best Regards
Andreas

[1] http://swtch.com/~rsc/regexp/regexp4.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message