lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: MaxFieldLength in Lucene 3.4
Date Thu, 01 Dec 2011 08:32:58 GMT
Hi,

This option is a safety thing in the case you cannot trust your input data.
Maybe you suddenly tokenize a binary file and produce millions of random
tokens. In that case only maybe 10000 are generated. If you input data is
trusted and text-based (e.g. read from elements in XML files,
databases,...), then you don't need this filter.

> Maybe I am too far behind the times.  I was updating some pretty old
stuff.
> I think it was written originally with Lucene 1.4.  I seem to recall that
Lucene
> v1.x had analyzers where the default was "limited", because I learned
pretty
> early that I had to set that option during indexing.  Perhaps at some
point the

The limiting option was almost always on IndexWriter, but it defaulted to
10000 tokens from the beginning. The analyzers had nothing to do with this
option.

The recent change removed the token counting from IndexWriter (as it only
makes the already complicated code more unreadable) and was moved to a
simple TokenFilter because it's much more reasonable to do it during
analysis.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Joe MA [mailto:mrjama@comcast.net]
> Sent: Thursday, December 01, 2011 9:24 AM
> To: general@lucene.apache.org
> Subject: RE: MaxFieldLength in Lucene 3.4
> 
> 
> > "of course all other analyzers are unlimited"
> 
> Maybe I am too far behind the times.  I was updating some pretty old
stuff.
> I think it was written originally with Lucene 1.4.  I seem to recall that
Lucene
> v1.x had analyzers where the default was "limited", because I learned
pretty
> early that I had to set that option during indexing.  Perhaps at some
point the
> switch was made to default unlimited.  Thanks your answer clears it up.
> 
> One question - why even have this option now? Are things more efficient
with a
> limited token field?  If you know your data is 'bounded', should you
always limit
> the token field to improve performance?
> 
> Thanks!
> 
> 
> -----Original Message-----
> From: Uwe Schindler [mailto:uwe@thetaphi.de]
> Sent: Monday, November 28, 2011 2:41 AM
> To: general@lucene.apache.org
> Subject: RE: MaxFieldLength in Lucene 3.4
> 
> Hi,
> 
> The move is simple - LimitTokenCountAnalyzer is just a wrapper around any
> other Analyzer, so I don't really understand your question - of course all
other
> analyzers are unlimited. If you have myAnalyzer with myMaxFieldLengthValue
> used before, you can change your code as follows:
> 
> Before:
> new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_34,
> myAnalyzer).setFoo().setBar().setMaxFieldLength(myMaxFieldLengthValue));
> 
> After:
> new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_34, new
> LimitTokenCountAnalyzer(myAnalyzer,
> myMaxFieldLengthValue)).setFoo().setBar());
> 
> You only have to do this on the indexing side, on the query side
> (QueryParser) just use myAnalyzer without wrapping. With the new code, the
> responsibilities for cutting the field after a specific number of tokens
was
> moved out out the indexing code in Lucene. This is now just an analysis
feature
> not a indexing feature anymore.
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> > -----Original Message-----
> > From: Joe MA [mailto:mrjama@comcast.net]
> > Sent: Monday, November 28, 2011 8:09 AM
> > To: general@lucene.apache.org
> > Subject: MaxFieldLength in Lucene 3.4
> >
> > While upgrading to Lucene 3.4, I noticed the MaxFieldLength values on
the
> > indexers are deprecated.   There appears to be a LimitTokenCountAnalyzer
> > that limits the tokens - so does that mean the default for all other
> analyzers is
> > unlimited?
> >
> > Thanks in advance -
> > JM



Mime
View raw message