lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: problems with large Lucene index
Date Wed, 11 Mar 2009 15:31:51 GMT

Unfortunately, I'm not familiar with exactly what Hibernate search  
does with the Lucene APIs.

It must be doing something beyond what your standalone Lucene test  
case does.

Maybe ask this question on the Hibernate list?

Mike

lucene@digiatlas.org wrote:

> Thanks for the advice.
>
> I haven't got around to profiling the code. Instead, I took your  
> advice and knocked Hibernate out of the equation with a small stand- 
> alone program that calls Lucene directly. I then wrote a similar  
> stand-alone using Hibernate Search to do the same thing.
>
> On a small index both work fine:
>
> E:\>java -Xmx1200M  -classpath .;lucene-core.jar lucky
> hits = 29410
>
> E:\hibtest>java -Xmx1200m -classpath .;lib/antlr-2.7.6.jar;lib/ 
> commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3- 
> persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate- 
> core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/ 
> jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j- 
> api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr- 
> core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/ 
> slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql- 
> connector-java.jar;lib/c3p0-0.9.1.jar hibtest
> size = 29410
>
>
> Trying it on our huge index works for the straight Lucene version:
>
> E:\>java -Xmx1200M  -classpath .;lucene-core.jar lucky
> hits = 320500
>
>
> but fails for the Hibernate version:
>
> E:\hibtest>java -Xmx1200m -classpath .;lib/antlr-2.7.6.jar;lib/ 
> commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3- 
> persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate- 
> core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/ 
> jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j- 
> api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr- 
> core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/ 
> slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql- 
> connector-java.jar;lib/c3p0-0.9.1.jar hibtest
> Exception in thread "main" java.lang.OutOfMemoryError
>        at java.io.RandomAccessFile.readBytes(Native Method)
>        at java.io.RandomAccessFile.read(Unknown Source)
>        at org.apache.lucene.store.FSDirectory 
> $FSIndexInput.readInternal(FSDirec
> tory.java:596)
>        at  
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
> ut.java:136)
>        at org.apache.lucene.index.CompoundFileReader 
> $CSIndexInput.readInternal(
> CompoundFileReader.java:247)
>        at  
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
> ut.java:136)
>        at  
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
> ut.java:92)
>        at  
> org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:907)
>        at  
> org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.j
> ava:352)
>        at org.apache.lucene.index.MultiReader.norms(MultiReader.java: 
> 273)
>        at org.apache.lucene.search.TermQuery 
> $TermWeight.scorer(TermQuery.java:6
> 9)
>        at  
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>
>        at  
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)
>
>        at org.apache.lucene.search.Searcher.search(Searcher.java:136)
>        at  
> org.hibernate.search.query.QueryHits.updateTopDocs(QueryHits.java:100
> )
>        at org.hibernate.search.query.QueryHits.<init>(QueryHits.java: 
> 61)
>        at  
> org.hibernate.search.query.FullTextQueryImpl.getQueryHits(FullTextQue
> ryImpl.java:354)
>        at  
> org.hibernate.search.query.FullTextQueryImpl.getResultSize(FullTextQu
> eryImpl.java:741)
>        at hibtest.main(hibtest.java:45)
>
> E:\hibtest>
>
>
> I am not sure why this is occuring. Any ideas?  I am calling  
> IndexSearcher.search() and so is Hibernate. Is Hibernate Search  
> telling Lucene to try and read in the entire index into memory?
>
>
> Code for the Lucene version is:
>
> public class lucky
> {
> 	public static void main(String[] args)
> 	{
> 		try
> 		{
> 			Term term = new Term("value", "church");
> 			Query query = new TermQuery(term);
> 			IndexSearcher searcher = new IndexSearcher("E:/lucene/indexes/ 
> uk.bl.dportal.marcdb.MarcText");
> 			Hits hits = searcher.search(query);
>
> 			System.out.println("hits = "+hits.length());
>
> 			searcher.close();
> 		}
> 		catch (Exception e)
> 		{
> 			e.printStackTrace();
> 		}
> 	}
> }
>
> and for the Hibernate Search version:
>
> public class hibtest {
>
>    public static void main(String[] args) {
>        hibtest mgr = new hibtest();
>
>        Session session =  
> HibernateUtil.getSessionFactory().getCurrentSession();
>
>        session.beginTransaction();
>
> 	FullTextSession fullTextSession = Search.getFullTextSession(session);
> 	TermQuery luceneQuery = new TermQuery(new Term("value", "church"));
>
> 	org.hibernate.search.FullTextQuery fullTextQuery =  
> fullTextSession.createFullTextQuery( luceneQuery, MarcText.class );
>
> 	long resultSize = fullTextQuery.getResultSize(); // this is line 45
>
> 	System.out.println("size = "+resultSize);
>
>        session.getTransaction().commit();
>
>        HibernateUtil.getSessionFactory().close();
>    }
>
> }
>
>
>
> Quoting Michael McCandless <lucene@mikemccandless.com>:
>
>>
>> At this point, I'd recommend running with a memory profiler, eg
>> YourKit, and posting the resulting output.
>>
>> With norms only on one field, no deletions, and no field sorting, I
>> can't see why you're running out of memory.
>>
>> If you take Hibernate out of the picture, and simply open an
>> IndexSearcher on the underlying index, do you still hit OOM?
>>
>> Can you post the output of CheckIndex?  You can run it from the  
>> command line:
>>
>>  java org.apache.lucene.index.CheckIndex <pathToIndex>
>>
>> (Without -fix, CheckIndex will make no changes to the index, but it's
>> best to do this on a copy of the index to be supremely safe).
>>
>> Mike
>>
>> lucene@digiatlas.org wrote:
>>
>>> Thanks Michael,
>>>
>>> There is no sorting on the result (adding a sort causes OOM well   
>>> before the point it runs out for the default).
>>>
>>> There are no deleted docs - the index was created from a set of   
>>> docs and no adds or deletes have taken place.
>>>
>>> Memory isn't being consumed elsewhere in the system. It all comes   
>>> down to the Lucene call via Hibernate Search. We decided to split   
>>> our huge index into a set of several smaller indexes. Like the   
>>> original single index, each smaller index has one field which is   
>>> tokenized and the other fields have NO_NORMS set.
>>>
>>> The following, explicitely specifying just one index, works fine:
>>>
>>> org.hibernate.search.FullTextQuery fullTextQuery =   
>>> fullTextSession.createFullTextQuery( outerLuceneQuery,   
>>> MarcText2.class );
>>>
>>> But as soon as we start adding further indexes:
>>>
>>> org.hibernate.search.FullTextQuery fullTextQuery =   
>>> fullTextSession.createFullTextQuery( outerLuceneQuery,   
>>> MarcText2.class, MarcText8.class );
>>>
>>> We start running into OOM.
>>>
>>> In our case the MarcText2 index has a total disk size of 5Gb  
>>> (with  57589069 documents / 75491779 terms) and MarcText8 has a  
>>> total size  of 6.46Gb (with 79339982 documents / 104943977 terms).
>>>
>>> Adding all 8 indexes (the same as our original single index),   
>>> either by explicitely naming them or just with:
>>>
>>> org.hibernate.search.FullTextQuery fullTextQuery =   
>>> fullTextSession.createFullTextQuery( outerLuceneQuery);
>>>
>>> results in it becoming completely unusable.
>>>
>>>
>>> One thing I am not sure about is that in Luke it tells me for an   
>>> index (neither of the indexes mentioned above) that was created   
>>> with NO_NORMS set on all the fields:
>>>
>>> "Index functionality: lock-less, single norms, shared doc store,   
>>> checksum, del count, omitTf"
>>>
>>> Is this correct?  I am not sure what it means by "single norms" -  
>>> I  would have expected it to say "no norms".
>>>
>>>
>>> Any further ideas on where to go from here? Your estimate of what   
>>> is loaded into memory suggests that we shouldn't really be  
>>> anywhere  near running out of memory with these size indexes!
>>>
>>> As I said in my OP, Luke also gets a heap error on searching our   
>>> original single large index which makes me wonder if it is a   
>>> problem with the construction of the index.
>>>
>>>
>>>
>>> Quoting Michael McCandless <lucene@mikemccandless.com>:
>>>
>>>>
>>>> Lucene is trying to allocate the contiguous norms array for your  
>>>> index,
>>>> which should be ~273 MB (=286/1024/1024), when it hits the OOM.
>>>>
>>>> Is your search sorting by field value?  (Which'd also consume  
>>>> memory.)
>>>> Or it's just the default (by relevance) sort?
>>>>
>>>> The only other biggish consumer of memory should be the deleted  
>>>> docs,
>>>> but that's a BitVector so it should need ~34 MB RAM.
>>>>
>>>> Can you run a memory profiler to see what else is consuming RAM?
>>>>
>>>> Mike
>>>>
>>>> lucene@digiatlas.org wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am using Lucene via Hibernate Search but the following  
>>>>> problem  is  also seen using Luke. I'd appreciate any  
>>>>> suggestions for  solving  this problem.
>>>>>
>>>>> I have a Lucene index (27Gb in size) that indexes a database   
>>>>> table  of 286 million rows. While Lucene was able to perform  
>>>>> this  indexing  just fine (albeit very slowly), using the index  
>>>>> has  proved to be  impossible. Any searches conducted on it,  
>>>>> either  from my Hibernate  Search query or by placing the query  
>>>>> into Luke  give:
>>>>>
>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271)
>>>>> at org.apache.lucene.search.TermQuery 
>>>>> $TermWeight.scorer(TermQuery.java:69)
>>>>> at   org.apache.lucene.search.BooleanQuery 
>>>>> $BooleanWeight.scorer(BooleanQuery.java:230)
>>>>> at  
>>>>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: 
>>>>> 131)
>>>>> ...
>>>>>
>>>>>
>>>>> The type of queries are simple, of the form:
>>>>>
>>>>> (+value:church +marcField:245 +subField:a)
>>>>>
>>>>> which in this example should only return a few thousand results.
>>>>>
>>>>>
>>>>> The interpreter is already running with the maximum of heap  
>>>>> space   allowed on for the Java executable running on Windows XP  
>>>>> ( java   -Xms 1200m -Xmx 1200m)
>>>>>
>>>>>
>>>>> The Lucene index was created using the following Hibernate  
>>>>> Search   annotations:
>>>>>
>>>>> @Column
>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>>> store=Store.NO)
>>>>> private Integer marcField;
>>>>>
>>>>> @Column (length = 2)
>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>>> store=Store.NO)
>>>>> private String subField;
>>>>>
>>>>> @Column(length = 2)
>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>>> store=Store.NO)
>>>>> private String indicator1;
>>>>>
>>>>> @Column(length = 2)
>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>>> store=Store.NO)
>>>>> private String indicator2;
>>>>>
>>>>> @Column(length = 10000)
>>>>> @Field(index=org.hibernate.search.annotations.Index.TOKENIZED,    
>>>>> store=Store.NO)
>>>>> private String value;
>>>>>
>>>>> @Column
>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,    
>>>>> store=Store.NO)
>>>>> private Integer recordId;
>>>>>
>>>>>
>>>>> So all of the fields have NO NORMS except for "value" which is    
>>>>> contains description text that needs to be tokenised.
>>>>>
>>>>> Is there any way around this?  Does Lucene really have such a  
>>>>> low   limit for how much data it can search (and I consider 286   
>>>>> million  documents to be pretty small beer - we were hoping to   
>>>>> index a table  of over a billion rows)? Or is there something  
>>>>> I'm  missing?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>
>
>


Mime
View raw message