lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: problems with large Lucene index
Date Mon, 09 Mar 2009 16:23:12 GMT
lucene@digiatlas.org wrote:
> Thanks Michael,
> 
> There is no sorting on the result (adding a sort causes OOM well before 
> the point it runs out for the default).
> 
> There are no deleted docs - the index was created from a set of docs and 
> no adds or deletes have taken place.
> 
> Memory isn't being consumed elsewhere in the system. It all comes down 
> to the Lucene call via Hibernate Search. We decided to split our huge 
> index into a set of several smaller indexes. Like the original single 
> index, each smaller index has one field which is tokenized and the other 
> fields have NO_NORMS set.
> 
> The following, explicitely specifying just one index, works fine:
> 
> org.hibernate.search.FullTextQuery fullTextQuery = 
> fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class );
> 
> But as soon as we start adding further indexes:
> 
> org.hibernate.search.FullTextQuery fullTextQuery = 
> fullTextSession.createFullTextQuery( outerLuceneQuery, MarcText2.class, 
> MarcText8.class );
> 
> We start running into OOM.
> 
> In our case the MarcText2 index has a total disk size of 5Gb (with 
> 57589069 documents / 75491779 terms) and MarcText8 has a total size of 
> 6.46Gb (with 79339982 documents / 104943977 terms).
> 
> Adding all 8 indexes (the same as our original single index), either by 
> explicitely naming them or just with:
> 
> org.hibernate.search.FullTextQuery fullTextQuery = 
> fullTextSession.createFullTextQuery( outerLuceneQuery);
> 
> results in it becoming completely unusable.
> 
> 
> One thing I am not sure about is that in Luke it tells me for an index 
> (neither of the indexes mentioned above) that was created with NO_NORMS 
> set on all the fields:
> 
> "Index functionality: lock-less, single norms, shared doc store, 
> checksum, del count, omitTf"
> 
> Is this correct?  I am not sure what it means by "single norms" - I 
> would have expected it to say "no norms".

This is just an expert-level info about the capability of the index 
format, it doesn't say anything about the actual flags on fields.

> Any further ideas on where to go from here? Your estimate of what is 
> loaded into memory suggests that we shouldn't really be anywhere near 
> running out of memory with these size indexes!
> 
> As I said in my OP, Luke also gets a heap error on searching our 
> original single large index which makes me wonder if it is a problem 
> with the construction of the index.

In the open index dialog in Luke set the the "Custom  term infos 
divisor" to a value higher than 1 and try to open the index again. If 
this still doesn' work, make a copy of the index, and then open a copy 
of this index in Luke - but try the option "Don't open IndexReader", and 
then run CheckIndex from menu. PLEASE DO THIS ON A COPY OF THE INDEX.

Oh, and of course you could start with increasing the heapsize when 
running Luke, but I think that's obvious.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message