lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Becker <pbec...@dstc.edu.au>
Subject Re: Indexing very large sets (10 million docs)
Date Mon, 28 Jul 2003 21:49:03 GMT
Roger Ford wrote:
[...index size troubles...]

> Believe it or not, this 10 million documents was meant to be a single
> partition of a much larger dataset. I'm not sure I'm at liberty to
> discuss in detail the data I'm indexing - but it's a massive
> geneological database. 

Roger,

maybe your data type is the problem. Did you check what kind of terms 
you get? (you can use http://www.getopt.org/luke/ for that) I can 
imagine that tokenizing just goes wrong, thus creating a few terms too 
many. And maybe a high hit rate for each term, too. Both would increase 
the index size -- at least if I would write an index with my limited 
knowledge about the field :-)

  Peter



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message