lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aleksander M. Stensby" <>
Subject Re: Question regarding sorting and memory consumption in lucene
Date Fri, 10 Oct 2008 14:25:29 GMT
Unfortunately no, since the documents that are added may come form a new  
"source" containing old documents aswell..:/
I tried deploying our webapplication without any searcher objects and it  
consumes basically ~200mb of memory in tomcat.
With 6 searchers the same applications manages to consume over 2.5 GB of  
memory when warming... :(
I might have done some super-idiotic logic in the way I handle searching,  
but I can seriously not see what that might be...

But I assume that people deal with much larger indexes than this, right?


On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood <>  

> Assuming content is added in chronological order and with no updates to  
> existing docs couldn't you rely on internal Lucene document id to give a  
> chronological sort order?
> That would require no memory cache at all when sorting.
> Querying across multiple indexes simultaneously however may present an  
> added complication...
> ----- Original Message ----
> From: Aleksander M. Stensby <>
> To:
> Sent: Friday, 10 October, 2008 13:51:50
> Subject: Re: Question regarding sorting and memory consumption in lucene
> I'll follow up on my own question...
> Let's say that we have 4 years of data, meaning that there will be  
> roughly
> 4 * 365 = 1460 unique terms for our sort field.
> For one index, lets say with 30 million docs, the cache should use approx
> 100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
> for the caches? (and an additional 100mb every time we warm a new  
> searcher
> and swap it out...) As far as the string versus int or long goes, I don't
> really see any big gain in changig it since 1460 * 10  bytes extra memory
> doesnt really make much difference. Or?
> I guess we should consider reducing the index size or at least only allow
> sorted search on a subset of the index (or a pruned version of the
> index...) ? Would that be better for us?
> But then again, I assume that there are much larger lucene-based indexes
> out there than ours, and you guys must have some solution to this issue,
> right? :)
> best regards,
>   Aleksander
> On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
> <> wrote:
>> Hello, I've read a lot of threads now on memory consumption and sorting,
>> and I think I have a pretty good understanding of how things work, but I
>> could still need some input here..
>> We currently have a system consisting of 6 different lucene indexes (all
>> have the same structure, so you could say it is a form of sharding). We
>> currently use this approach because we want to be able to give users
>> access to different index (but not necessarily  all indexes) etc.
>> (We are planning to move to a solr-based system, but for now we would
>> like to solve this issue with our current lucene-based system.)
>> The thing is, the indexes are rather big (ranging from 5G to 20G per
>> index and 10 - 30 million entries per index.)
>> We keep one searcher object open per index, and when the index is
>> changed (new documents added in batches several times a day), we update
>> the searcher objects.
>> In the warmup procedure we did a couple of searches and things work
>> fine, BUT i realized that in our application we return hits sorted by
>> date by default, and our warmup procedure did non-sorted queries... so
>> still the first searches done by the user after an update was slow
>> (obviously).
>> To cope, I changed the warmup procedure to include a sorted search, and
>> now the user will not notice slow queries. Good!
>> But, the problem at hand is that we are running into memory problems
>> (and I understand that sorting does consume a lot of memory...) But is
>> there any way that is "best practice" to deal with this? The field we
>> sort on is an un_indexed text field representing the date. typically
>> "2008-10-10". I am aware that string field sorting consumes a lot of
>> memory, so should we change this field to something different? Would
>> this help us with the memory problems?
>> As a sidenote / couriosity question: Does it matter if we use the search
>> method returning Hits versus the search method returning TopFieldDocs?
>> (we are not iterating them in any way when this memory issue occurs)
>> Thanks in advance for any guidance we may get.
>> Best regards,
>>   Aleksander M. Stensby

Aleksander M. Stensby
Senior Software Developer
Integrasco A/S
+47 41 22 82 72

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message