lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Question regarding sorting and memory consumption in lucene
Date Fri, 10 Oct 2008 13:18:46 GMT
Assuming content is added in chronological order and with no updates to existing docs couldn't
you rely on internal Lucene document id to give a chronological sort order?
That would require no memory cache at all when sorting.

Querying across multiple indexes simultaneously however may present an added complication...



----- Original Message ----
From: Aleksander M. Stensby <aleksander.stensby@integrasco.no>
To: java-user@lucene.apache.org
Sent: Friday, 10 October, 2008 13:51:50
Subject: Re: Question regarding sorting and memory consumption in lucene

I'll follow up on my own question...
Let's say that we have 4 years of data, meaning that there will be roughly  
4 * 365 = 1460 unique terms for our sort field.
For one index, lets say with 30 million docs, the cache should use approx  
100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb  
for the caches? (and an additional 100mb every time we warm a new searcher  
and swap it out...) As far as the string versus int or long goes, I don't  
really see any big gain in changig it since 1460 * 10  bytes extra memory  
doesnt really make much difference. Or?

I guess we should consider reducing the index size or at least only allow  
sorted search on a subset of the index (or a pruned version of the  
index...) ? Would that be better for us?
But then again, I assume that there are much larger lucene-based indexes  
out there than ours, and you guys must have some solution to this issue,  
right? :)

best regards,
  Aleksander


On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby  
<aleksander.stensby@integrasco.no> wrote:

> Hello, I've read a lot of threads now on memory consumption and sorting,  
> and I think I have a pretty good understanding of how things work, but I  
> could still need some input here..
>
> We currently have a system consisting of 6 different lucene indexes (all  
> have the same structure, so you could say it is a form of sharding). We  
> currently use this approach because we want to be able to give users  
> access to different index (but not necessarily  all indexes) etc.
>
> (We are planning to move to a solr-based system, but for now we would  
> like to solve this issue with our current lucene-based system.)
>
> The thing is, the indexes are rather big (ranging from 5G to 20G per  
> index and 10 - 30 million entries per index.)
> We keep one searcher object open per index, and when the index is  
> changed (new documents added in batches several times a day), we update  
> the searcher objects.
> In the warmup procedure we did a couple of searches and things work  
> fine, BUT i realized that in our application we return hits sorted by  
> date by default, and our warmup procedure did non-sorted queries... so  
> still the first searches done by the user after an update was slow  
> (obviously).
>
> To cope, I changed the warmup procedure to include a sorted search, and  
> now the user will not notice slow queries. Good!
> But, the problem at hand is that we are running into memory problems  
> (and I understand that sorting does consume a lot of memory...) But is  
> there any way that is "best practice" to deal with this? The field we  
> sort on is an un_indexed text field representing the date. typically  
> "2008-10-10". I am aware that string field sorting consumes a lot of  
> memory, so should we change this field to something different? Would  
> this help us with the memory problems?
>
> As a sidenote / couriosity question: Does it matter if we use the search  
> method returning Hits versus the search method returning TopFieldDocs?  
> (we are not iterating them in any way when this memory issue occurs)
>
> Thanks in advance for any guidance we may get.
>
> Best regards,
>   Aleksander M. Stensby
>
>
>



-- 
Aleksander M. Stensby
Senior Software Developer
Integrasco A/S
+47 41 22 82 72
aleksander.stensby@integrasco.no

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message