lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From luc...@digiatlas.org
Subject Re: problems with large Lucene index
Date Wed, 11 Mar 2009 13:20:02 GMT
Thanks for the advice.

I haven't got around to profiling the code. Instead, I took your  
advice and knocked Hibernate out of the equation with a small  
stand-alone program that calls Lucene directly. I then wrote a similar  
stand-alone using Hibernate Search to do the same thing.

On a small index both work fine:

E:\>java -Xmx1200M  -classpath .;lucene-core.jar lucky
hits = 29410

E:\hibtest>java -Xmx1200m -classpath  
.;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar
 
hibtest
size = 29410


Trying it on our huge index works for the straight Lucene version:

E:\>java -Xmx1200M  -classpath .;lucene-core.jar lucky
hits = 320500


but fails for the Hibernate version:

E:\hibtest>java -Xmx1200m -classpath  
.;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar
 
hibtest
Exception in thread "main" java.lang.OutOfMemoryError
         at java.io.RandomAccessFile.readBytes(Native Method)
         at java.io.RandomAccessFile.read(Unknown Source)
         at  
org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirec
tory.java:596)
         at  
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
ut.java:136)
         at  
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(
CompoundFileReader.java:247)
         at  
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
ut.java:136)
         at  
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
ut.java:92)
         at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:907)
         at  
org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.j
ava:352)
         at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273)
         at  
org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:6
9)
         at  
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)

         at  
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)

         at org.apache.lucene.search.Searcher.search(Searcher.java:136)
         at  
org.hibernate.search.query.QueryHits.updateTopDocs(QueryHits.java:100
)
         at org.hibernate.search.query.QueryHits.<init>(QueryHits.java:61)
         at  
org.hibernate.search.query.FullTextQueryImpl.getQueryHits(FullTextQue
ryImpl.java:354)
         at  
org.hibernate.search.query.FullTextQueryImpl.getResultSize(FullTextQu
eryImpl.java:741)
         at hibtest.main(hibtest.java:45)

E:\hibtest>


I am not sure why this is occuring. Any ideas?  I am calling  
IndexSearcher.search() and so is Hibernate. Is Hibernate Search  
telling Lucene to try and read in the entire index into memory?


Code for the Lucene version is:

public class lucky
{
	public static void main(String[] args)
	{
		try
		{
			Term term = new Term("value", "church");
			Query query = new TermQuery(term);
			IndexSearcher searcher = new  
IndexSearcher("E:/lucene/indexes/uk.bl.dportal.marcdb.MarcText");
			Hits hits = searcher.search(query);

			System.out.println("hits = "+hits.length());

			searcher.close();
		}
		catch (Exception e)
		{
			e.printStackTrace();
		}
	}
}

and for the Hibernate Search version:

public class hibtest {

     public static void main(String[] args) {
         hibtest mgr = new hibtest();

         Session session =  
HibernateUtil.getSessionFactory().getCurrentSession();

         session.beginTransaction();

	FullTextSession fullTextSession = Search.getFullTextSession(session);
	TermQuery luceneQuery = new TermQuery(new Term("value", "church"));

	org.hibernate.search.FullTextQuery fullTextQuery =  
fullTextSession.createFullTextQuery( luceneQuery, MarcText.class );

	long resultSize = fullTextQuery.getResultSize(); // this is line 45

	System.out.println("size = "+resultSize);

         session.getTransaction().commit();

         HibernateUtil.getSessionFactory().close();
     }

}



Quoting Michael McCandless <lucene@mikemccandless.com>:

>
> At this point, I'd recommend running with a memory profiler, eg
> YourKit, and posting the resulting output.
>
> With norms only on one field, no deletions, and no field sorting, I
> can't see why you're running out of memory.
>
> If you take Hibernate out of the picture, and simply open an
> IndexSearcher on the underlying index, do you still hit OOM?
>
> Can you post the output of CheckIndex?  You can run it from the command line:
>
>   java org.apache.lucene.index.CheckIndex <pathToIndex>
>
> (Without -fix, CheckIndex will make no changes to the index, but it's
> best to do this on a copy of the index to be supremely safe).
>
> Mike
>
> lucene@digiatlas.org wrote:
>
>> Thanks Michael,
>>
>> There is no sorting on the result (adding a sort causes OOM well   
>> before the point it runs out for the default).
>>
>> There are no deleted docs - the index was created from a set of   
>> docs and no adds or deletes have taken place.
>>
>> Memory isn't being consumed elsewhere in the system. It all comes   
>> down to the Lucene call via Hibernate Search. We decided to split   
>> our huge index into a set of several smaller indexes. Like the   
>> original single index, each smaller index has one field which is   
>> tokenized and the other fields have NO_NORMS set.
>>
>> The following, explicitely specifying just one index, works fine:
>>
>> org.hibernate.search.FullTextQuery fullTextQuery =   
>> fullTextSession.createFullTextQuery( outerLuceneQuery,   
>> MarcText2.class );
>>
>> But as soon as we start adding further indexes:
>>
>> org.hibernate.search.FullTextQuery fullTextQuery =   
>> fullTextSession.createFullTextQuery( outerLuceneQuery,   
>> MarcText2.class, MarcText8.class );
>>
>> We start running into OOM.
>>
>> In our case the MarcText2 index has a total disk size of 5Gb (with   
>> 57589069 documents / 75491779 terms) and MarcText8 has a total size  
>>  of 6.46Gb (with 79339982 documents / 104943977 terms).
>>
>> Adding all 8 indexes (the same as our original single index),   
>> either by explicitely naming them or just with:
>>
>> org.hibernate.search.FullTextQuery fullTextQuery =   
>> fullTextSession.createFullTextQuery( outerLuceneQuery);
>>
>> results in it becoming completely unusable.
>>
>>
>> One thing I am not sure about is that in Luke it tells me for an   
>> index (neither of the indexes mentioned above) that was created   
>> with NO_NORMS set on all the fields:
>>
>> "Index functionality: lock-less, single norms, shared doc store,   
>> checksum, del count, omitTf"
>>
>> Is this correct?  I am not sure what it means by "single norms" - I  
>>  would have expected it to say "no norms".
>>
>>
>> Any further ideas on where to go from here? Your estimate of what   
>> is loaded into memory suggests that we shouldn't really be anywhere  
>>  near running out of memory with these size indexes!
>>
>> As I said in my OP, Luke also gets a heap error on searching our   
>> original single large index which makes me wonder if it is a   
>> problem with the construction of the index.
>>
>>
>>
>> Quoting Michael McCandless <lucene@mikemccandless.com>:
>>
>>>
>>> Lucene is trying to allocate the contiguous norms array for your index,
>>> which should be ~273 MB (=286/1024/1024), when it hits the OOM.
>>>
>>> Is your search sorting by field value?  (Which'd also consume memory.)
>>> Or it's just the default (by relevance) sort?
>>>
>>> The only other biggish consumer of memory should be the deleted docs,
>>> but that's a BitVector so it should need ~34 MB RAM.
>>>
>>> Can you run a memory profiler to see what else is consuming RAM?
>>>
>>> Mike
>>>
>>> lucene@digiatlas.org wrote:
>>>
>>>> Hello,
>>>>
>>>> I am using Lucene via Hibernate Search but the following problem   
>>>> is  also seen using Luke. I'd appreciate any suggestions for   
>>>> solving  this problem.
>>>>
>>>> I have a Lucene index (27Gb in size) that indexes a database   
>>>> table  of 286 million rows. While Lucene was able to perform this  
>>>>  indexing  just fine (albeit very slowly), using the index has   
>>>> proved to be  impossible. Any searches conducted on it, either   
>>>> from my Hibernate  Search query or by placing the query into Luke  
>>>>  give:
>>>>
>>>> java.lang.OutOfMemoryError: Java heap space
>>>> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271)
>>>> at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69)
>>>> at    
>>>> org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:230)
>>>> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>>>> ...
>>>>
>>>>
>>>> The type of queries are simple, of the form:
>>>>
>>>> (+value:church +marcField:245 +subField:a)
>>>>
>>>> which in this example should only return a few thousand results.
>>>>
>>>>
>>>> The interpreter is already running with the maximum of heap space  
>>>>   allowed on for the Java executable running on Windows XP ( java  
>>>>   -Xms 1200m -Xmx 1200m)
>>>>
>>>>
>>>> The Lucene index was created using the following Hibernate Search  
>>>>   annotations:
>>>>
>>>> @Column
>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>> store=Store.NO)
>>>> private Integer marcField;
>>>>
>>>> @Column (length = 2)
>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>> store=Store.NO)
>>>> private String subField;
>>>>
>>>> @Column(length = 2)
>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>> store=Store.NO)
>>>> private String indicator1;
>>>>
>>>> @Column(length = 2)
>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>> store=Store.NO)
>>>> private String indicator2;
>>>>
>>>> @Column(length = 10000)
>>>> @Field(index=org.hibernate.search.annotations.Index.TOKENIZED,    
>>>> store=Store.NO)
>>>> private String value;
>>>>
>>>> @Column
>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>> store=Store.NO)
>>>> private Integer recordId;
>>>>
>>>>
>>>> So all of the fields have NO NORMS except for "value" which is    
>>>> contains description text that needs to be tokenised.
>>>>
>>>> Is there any way around this?  Does Lucene really have such a low  
>>>>   limit for how much data it can search (and I consider 286   
>>>> million  documents to be pretty small beer - we were hoping to   
>>>> index a table  of over a billion rows)? Or is there something I'm  
>>>>  missing?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>
>>
>>




Mime
View raw message