lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: OutOfMemoryError on addIndexes()
Date Wed, 17 Aug 2005 02:38:59 GMT
Under many usecases a date field is often indexed. If the granularity
of the date value is in milliseconds, the number of unique terms in
the index could potentially be huge.

So if this is indeed the case, it is a potential scalability
bottleneck in lucene index size.

Thanks

-John

On 8/12/05, Chris Hostetter <hossman_lucene@fucit.org> wrote:
> 
> Okay, just for the record, I'm currently on vacation, and i don't have
> access to any of my indexes at work in order to make a comparison, but the
> number of unique terms in your index (which is i'm 99% sure what
> indexEnum.size represents in the code you cited) seems HUGE!!!
> 
> You havne't given us a lot of details about what your index contains (ie:
> the nature of the documents) .. in fact, for the number of terms you cite
> (811806819) the only info we have is that the index containing that number
> of terms is 29MB in size -- no idea how many documents are in that index.
> But if we look at your previous email, you mentioend having a nother index
> that cuases the same problem which is 120MB, which you built from 11359
> files.  If we assume that index has no more then the same number of unique
> terms indexed (which seems unlikely, but lets give it the benefit of the
> doubt, and assume the added size is all stored fields) and assume that you
> made one document per file, and that those files are 100% unique from each
> other, and contain no terms in common -- that means that each file
> contains roughtly 71,500 unique terms.
> 
> that seems like a lot.
> 
> A quick google search tells me that the english language contains
> somewhere from 500,000 to 1,000,000 words - your index has 800 times that
> many terms.  even assuming you index a lot of numerical or date based data
> -- that seems like a lot.
> 
> I have to wonder if maybe you are indexing a lot of junk information by
> mistake - perhaps some binary data is mistakenly getting treated as
> strings?
> 
> can you tell us more about the nature of your indexes?
> 
> 
> : Date: Fri, 12 Aug 2005 09:45:40 +0200
> : From: Trezzi Michael <MTrezzi@CSAS.CZ>
> : Reply-To: java-user@lucene.apache.org
> : To: java-user@lucene.apache.org
> : Subject: RE: OutOfMemoryError on addIndexes()
> :
> : I did some more research and these are the results.
> :  The OutOfMemory occurs on line 82 of class TermInfosReader.java. That
> : and two other lines are trying to create an array of the size that is
> : get by the
> L
> : int indexSize = (int)indexEnum.size;
> :
> : For my 29MB index this indexSize integer is 811806819. So if creating 3
> : arrays of this size (line 82-84) it requires and enormous amount of
> : memory, e.g. a model situation...a char array...char has 2 bytes *
> : 811806819 => 1584MB. That seems to be a little much and objects created
> : in those arrays Term, TermInfo and long are definately not simple chars.
> : This way I would need several gigabytes of memory to merge several even
> : small (30MB) indexes. Is this the standard way how it works, or is there
> : a problem on my side?
> :
> : Thanks,
> :
> : Michael
> :
> : ________________________________
> :
> : Od: Ian Lea [mailto:ian.lea@gmail.com]
> : Odesláno: st 10.8.2005 12:34
> : Komu: java-user@lucene.apache.org
> : P?edm?t: Re: OutOfMemoryError on addIndexes()
> :
> :
> :
> : How much memory are you giving your programs?
> :
> :  java    -Xmx<size>        set maximum Java heap size
> :
> : --
> : Ian.
> :
> : On 10/08/05, Trezzi Michael <MTrezzi@csas.cz> wrote:
> : > Hello,
> : > I have a problem and i tried everything i could think of to solve it. TO understand
my situation, i create indexes on several computers on our network and they are copied to
one server. There, once a day, they are merged into one masterIndex, which is then searched.
The problem is in merging. I use the following code:
> : >
> : > Directory[] ar = new Directory[fileList.length];
> : >        for(int i=0; i<fileList.length;i++) {
> : >            ar[i] = FSDirectory.getDirectory(fileList[i], false);
> : >        }
> : >        writer.addIndexes(ar);
> : >        for(int i=0; i<fileList.length;i++) {
> : >            ar[i].close();
> : >        }
> : >       writer.optimize();
> : >       writer.close();
> : >
> : > I also tried a longer way of opening every index separately and adding it document
by document. The problem is i am getting OutOfMemory errors on this. When I use the per document
way, it happens on the IndexReader.open command and only on indexes of approx 100M+ (The largest
index I have is only about 150MB) When I run it on windows machine with JDK1.5 I get the following:
> : >     Exception in thread "main" java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
> : > On Linux I am running 1.4 and I get the message without the Array size information.
> : >
> : > I did try it also on test index that was made from 11359 files  (1,59GB) that
had 120MB and I got this error too. In my opinion 120MB index is not that big. The machine
it runs on is a Xeon 3,2GHz with 2GB of RAM, so it should be enough. Can you please help me?
> : >
> : > Thank you in advance,
> : >
> : > Michael Trezzi
> :
> : ---------------------------------------------------------------------
> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : For additional commands, e-mail: java-user-help@lucene.apache.org
> :
> :
> :
> :
> :
> 
> 
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message