lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jokin Cuadrado <joki...@gmail.com>
Subject Re: problems with large Lucene index
Date Wed, 11 Mar 2009 16:16:53 GMT
i haved  a look to the exception, and see that hibernate is using a function
called updateTopDocs to make the search, this function uses a filter, and
this may add some more memory requeriments, could you create a filter also
with the lucene standalone version?
something like this before the search:
Queryfilter queryfilt = new QueryFilter(query);
and the pass the filter to the search.

However, with such a huge index I think you must find the way to get some
more memory for the java heap.

hibernate queryhits:
http://viewvc.jboss.org/cgi-bin/viewvc.cgi/hibernate/search/trunk/src/java/org/hibernate/search/query/QueryHits.java?view=markup&pathrev=15603


On Wed, Mar 11, 2009 at 4:31 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Unfortunately, I'm not familiar with exactly what Hibernate search does
> with the Lucene APIs.
>
> It must be doing something beyond what your standalone Lucene test case
> does.
>
> Maybe ask this question on the Hibernate list?
>
>
> Mike
>
> lucene@digiatlas.org wrote:
>
>  Thanks for the advice.
>>
>> I haven't got around to profiling the code. Instead, I took your advice
>> and knocked Hibernate out of the equation with a small stand-alone program
>> that calls Lucene directly. I then wrote a similar stand-alone using
>> Hibernate Search to do the same thing.
>>
>> On a small index both work fine:
>>
>> E:\>java -Xmx1200M  -classpath .;lucene-core.jar lucky
>> hits = 29410
>>
>> E:\hibtest>java -Xmx1200m -classpath
>> .;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar
>> hibtest
>> size = 29410
>>
>>
>> Trying it on our huge index works for the straight Lucene version:
>>
>> E:\>java -Xmx1200M  -classpath .;lucene-core.jar lucky
>> hits = 320500
>>
>>
>> but fails for the Hibernate version:
>>
>> E:\hibtest>java -Xmx1200m -classpath
>> .;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar
>> hibtest
>> Exception in thread "main" java.lang.OutOfMemoryError
>>       at java.io.RandomAccessFile.readBytes(Native Method)
>>       at java.io.RandomAccessFile.read(Unknown Source)
>>       at
>> org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirec
>> tory.java:596)
>>       at
>> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
>> ut.java:136)
>>       at
>> org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(
>> CompoundFileReader.java:247)
>>       at
>> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
>> ut.java:136)
>>       at
>> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
>> ut.java:92)
>>       at
>> org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:907)
>>       at
>> org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.j
>> ava:352)
>>       at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273)
>>       at
>> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:6
>> 9)
>>       at
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>>
>>       at
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)
>>
>>       at org.apache.lucene.search.Searcher.search(Searcher.java:136)
>>       at
>> org.hibernate.search.query.QueryHits.updateTopDocs(QueryHits.java:100
>> )
>>       at org.hibernate.search.query.QueryHits.<init>(QueryHits.java:61)
>>       at
>> org.hibernate.search.query.FullTextQueryImpl.getQueryHits(FullTextQue
>> ryImpl.java:354)
>>       at
>> org.hibernate.search.query.FullTextQueryImpl.getResultSize(FullTextQu
>> eryImpl.java:741)
>>       at hibtest.main(hibtest.java:45)
>>
>> E:\hibtest>
>>
>>
>> I am not sure why this is occuring. Any ideas?  I am calling
>> IndexSearcher.search() and so is Hibernate. Is Hibernate Search telling
>> Lucene to try and read in the entire index into memory?
>>
>>
>> Code for the Lucene version is:
>>
>> public class lucky
>> {
>>        public static void main(String[] args)
>>        {
>>                try
>>                {
>>                        Term term = new Term("value", "church");
>>                        Query query = new TermQuery(term);
>>                        IndexSearcher searcher = new
>> IndexSearcher("E:/lucene/indexes/uk.bl.dportal.marcdb.MarcText");
>>                        Hits hits = searcher.search(query);
>>
>>                        System.out.println("hits = "+hits.length());
>>
>>                        searcher.close();
>>                }
>>                catch (Exception e)
>>                {
>>                        e.printStackTrace();
>>                }
>>        }
>> }
>>
>> and for the Hibernate Search version:
>>
>> public class hibtest {
>>
>>   public static void main(String[] args) {
>>       hibtest mgr = new hibtest();
>>
>>       Session session =
>> HibernateUtil.getSessionFactory().getCurrentSession();
>>
>>       session.beginTransaction();
>>
>>        FullTextSession fullTextSession =
>> Search.getFullTextSession(session);
>>        TermQuery luceneQuery = new TermQuery(new Term("value", "church"));
>>
>>        org.hibernate.search.FullTextQuery fullTextQuery =
>> fullTextSession.createFullTextQuery( luceneQuery, MarcText.class );
>>
>>        long resultSize = fullTextQuery.getResultSize(); // this is line 45
>>
>>        System.out.println("size = "+resultSize);
>>
>>       session.getTransaction().commit();
>>
>>       HibernateUtil.getSessionFactory().close();
>>   }
>>
>> }
>>
>>
>>
>> Quoting Michael McCandless <lucene@mikemccandless.com>:
>>
>>
>>> At this point, I'd recommend running with a memory profiler, eg
>>> YourKit, and posting the resulting output.
>>>
>>> With norms only on one field, no deletions, and no field sorting, I
>>> can't see why you're running out of memory.
>>>
>>> If you take Hibernate out of the picture, and simply open an
>>> IndexSearcher on the underlying index, do you still hit OOM?
>>>
>>> Can you post the output of CheckIndex?  You can run it from the command
>>> line:
>>>
>>>  java org.apache.lucene.index.CheckIndex <pathToIndex>
>>>
>>> (Without -fix, CheckIndex will make no changes to the index, but it's
>>> best to do this on a copy of the index to be supremely safe).
>>>
>>> Mike
>>>
>>> lucene@digiatlas.org wrote:
>>>
>>>  Thanks Michael,
>>>>
>>>> There is no sorting on the result (adding a sort causes OOM well  before
>>>> the point it runs out for the default).
>>>>
>>>> There are no deleted docs - the index was created from a set of  docs
>>>> and no adds or deletes have taken place.
>>>>
>>>> Memory isn't being consumed elsewhere in the system. It all comes  down
>>>> to the Lucene call via Hibernate Search. We decided to split  our huge index
>>>> into a set of several smaller indexes. Like the  original single index, each
>>>> smaller index has one field which is  tokenized and the other fields have
>>>> NO_NORMS set.
>>>>
>>>> The following, explicitely specifying just one index, works fine:
>>>>
>>>> org.hibernate.search.FullTextQuery fullTextQuery =
>>>>  fullTextSession.createFullTextQuery( outerLuceneQuery,  MarcText2.class
);
>>>>
>>>> But as soon as we start adding further indexes:
>>>>
>>>> org.hibernate.search.FullTextQuery fullTextQuery =
>>>>  fullTextSession.createFullTextQuery( outerLuceneQuery,  MarcText2.class,
>>>> MarcText8.class );
>>>>
>>>> We start running into OOM.
>>>>
>>>> In our case the MarcText2 index has a total disk size of 5Gb (with
>>>>  57589069 documents / 75491779 terms) and MarcText8 has a total size  of
>>>> 6.46Gb (with 79339982 documents / 104943977 terms).
>>>>
>>>> Adding all 8 indexes (the same as our original single index),  either by
>>>> explicitely naming them or just with:
>>>>
>>>> org.hibernate.search.FullTextQuery fullTextQuery =
>>>>  fullTextSession.createFullTextQuery( outerLuceneQuery);
>>>>
>>>> results in it becoming completely unusable.
>>>>
>>>>
>>>> One thing I am not sure about is that in Luke it tells me for an  index
>>>> (neither of the indexes mentioned above) that was created  with NO_NORMS
set
>>>> on all the fields:
>>>>
>>>> "Index functionality: lock-less, single norms, shared doc store,
>>>>  checksum, del count, omitTf"
>>>>
>>>> Is this correct?  I am not sure what it means by "single norms" - I
>>>>  would have expected it to say "no norms".
>>>>
>>>>
>>>> Any further ideas on where to go from here? Your estimate of what  is
>>>> loaded into memory suggests that we shouldn't really be anywhere  near
>>>> running out of memory with these size indexes!
>>>>
>>>> As I said in my OP, Luke also gets a heap error on searching our
>>>>  original single large index which makes me wonder if it is a  problem with
>>>> the construction of the index.
>>>>
>>>>
>>>>
>>>> Quoting Michael McCandless <lucene@mikemccandless.com>:
>>>>
>>>>
>>>>> Lucene is trying to allocate the contiguous norms array for your index,
>>>>> which should be ~273 MB (=286/1024/1024), when it hits the OOM.
>>>>>
>>>>> Is your search sorting by field value?  (Which'd also consume memory.)
>>>>> Or it's just the default (by relevance) sort?
>>>>>
>>>>> The only other biggish consumer of memory should be the deleted docs,
>>>>> but that's a BitVector so it should need ~34 MB RAM.
>>>>>
>>>>> Can you run a memory profiler to see what else is consuming RAM?
>>>>>
>>>>> Mike
>>>>>
>>>>> lucene@digiatlas.org wrote:
>>>>>
>>>>>  Hello,
>>>>>>
>>>>>> I am using Lucene via Hibernate Search but the following problem
 is
>>>>>>  also seen using Luke. I'd appreciate any suggestions for  solving
 this
>>>>>> problem.
>>>>>>
>>>>>> I have a Lucene index (27Gb in size) that indexes a database  table
>>>>>>  of 286 million rows. While Lucene was able to perform this  indexing
 just
>>>>>> fine (albeit very slowly), using the index has  proved to be  impossible.
>>>>>> Any searches conducted on it, either  from my Hibernate  Search query
or by
>>>>>> placing the query into Luke  give:
>>>>>>
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>> at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271)
>>>>>> at
>>>>>> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69)
>>>>>> at
>>>>>> org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:230)
>>>>>> at
>>>>>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> The type of queries are simple, of the form:
>>>>>>
>>>>>> (+value:church +marcField:245 +subField:a)
>>>>>>
>>>>>> which in this example should only return a few thousand results.
>>>>>>
>>>>>>
>>>>>> The interpreter is already running with the maximum of heap space
>>>>>> allowed on for the Java executable running on Windows XP ( java 
 -Xms 1200m
>>>>>> -Xmx 1200m)
>>>>>>
>>>>>>
>>>>>> The Lucene index was created using the following Hibernate Search
>>>>>> annotations:
>>>>>>
>>>>>> @Column
>>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,
>>>>>> store=Store.NO)
>>>>>> private Integer marcField;
>>>>>>
>>>>>> @Column (length = 2)
>>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,
>>>>>> store=Store.NO)
>>>>>> private String subField;
>>>>>>
>>>>>> @Column(length = 2)
>>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,
>>>>>> store=Store.NO)
>>>>>> private String indicator1;
>>>>>>
>>>>>> @Column(length = 2)
>>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,
>>>>>> store=Store.NO)
>>>>>> private String indicator2;
>>>>>>
>>>>>> @Column(length = 10000)
>>>>>> @Field(index=org.hibernate.search.annotations.Index.TOKENIZED,
>>>>>> store=Store.NO)
>>>>>> private String value;
>>>>>>
>>>>>> @Column
>>>>>> @Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
>>>>>> @Field(index=org.hibernate.search.annotations.Index.NO_NORMS,
>>>>>> store=Store.NO)
>>>>>> private Integer recordId;
>>>>>>
>>>>>>
>>>>>> So all of the fields have NO NORMS except for "value" which is
>>>>>> contains description text that needs to be tokenised.
>>>>>>
>>>>>> Is there any way around this?  Does Lucene really have such a low
>>>>>> limit for how much data it can search (and I consider 286  million
>>>>>>  documents to be pretty small beer - we were hoping to  index a table
 of
>>>>>> over a billion rows)? Or is there something I'm  missing?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>


-- 
Jokin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message