Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.co.in;
  h=Received:X-YMail-OSG:X-Yahoo-Newman-Property:Message-ID:From:To:References:Subject:Date:MIME-Version:Content-Type:Content-Transfer-Encoding:X-Priority:X-MSMail-Priority:X-Mailer:X-MimeOLE;
  b=jHwVUhRyAjaSXD2WLZQ4QgQMs3qvBA0rmwxe7g9VvLh456mLBuR7ZSUisD+kLAMPuN/U9Pxt7ahfPi3AtlAj58PgxLH/10YWixO8Gr1YJhphQ2qiUocc3W7Xq0ROGpuIsHrdOYOAbXmcu8+kZbMOD6I/Fzn3SDNwsyB1DJYM81U=
  ;
Message-ID: <008e01c92d31$30eb6370$330bc30a@sv.us.sonicwall.com>
From: "Ganesh" <emailgane@yahoo.co.in>
To: <java-user@lucene.apache.org>
References: <915178.28087.qm@web26005.mail.ukl.yahoo.com>
Subject: Re: Question regarding sorting and memory consumption in lucene
Date: Mon, 13 Oct 2008 18:12:32 +0530
MIME-Version: 1.0
Content-Type: text/plain;
	format=flowed;
	charset="utf-8";
	reply-type=original
Content-Transfer-Encoding: 7bit

Hello Mark,

I am also facing the same sorting issue.
In my case there will be only addition and deletion of data [no modification 
of existing records]. Whether i could rely on the indexed order of sorting.

"SortField.FIELD_DOC" is the one helps to do sorting on indexed order?

Regards
Ganesh


----- Original Message ----- 
From: "mark harwood" <markharw00d@yahoo.co.uk>
To: <java-user@lucene.apache.org>
Sent: Friday, October 10, 2008 6:48 PM
Subject: Re: Question regarding sorting and memory consumption in lucene


Assuming content is added in chronological order and with no updates to 
existing docs couldn't you rely on internal Lucene document id to give a 
chronological sort order?
That would require no memory cache at all when sorting.

Querying across multiple indexes simultaneously however may present an added 
complication...


----- Original Message ----
From: Aleksander M. Stensby <aleksander.stensby@integrasco.no>
To: java-user@lucene.apache.org
Sent: Friday, 10 October, 2008 13:51:50
Subject: Re: Question regarding sorting and memory consumption in lucene

I'll follow up on my own question...
Let's say that we have 4 years of data, meaning that there will be roughly
4 * 365 = 1460 unique terms for our sort field.
For one index, lets say with 30 million docs, the cache should use approx
100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
for the caches? (and an additional 100mb every time we warm a new searcher
and swap it out...) As far as the string versus int or long goes, I don't
really see any big gain in changig it since 1460 * 10  bytes extra memory
doesnt really make much difference. Or?

I guess we should consider reducing the index size or at least only allow
sorted search on a subset of the index (or a pruned version of the
index...) ? Would that be better for us?
But then again, I assume that there are much larger lucene-based indexes
out there than ours, and you guys must have some solution to this issue,
right? :)

best regards,
  Aleksander


On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
<aleksander.stensby@integrasco.no> wrote:

> Hello, I've read a lot of threads now on memory consumption and sorting,
> and I think I have a pretty good understanding of how things work, but I
> could still need some input here..
>
> We currently have a system consisting of 6 different lucene indexes (all
> have the same structure, so you could say it is a form of sharding). We
> currently use this approach because we want to be able to give users
> access to different index (but not necessarily  all indexes) etc.
>
> (We are planning to move to a solr-based system, but for now we would
> like to solve this issue with our current lucene-based system.)
>
> The thing is, the indexes are rather big (ranging from 5G to 20G per
> index and 10 - 30 million entries per index.)
> We keep one searcher object open per index, and when the index is
> changed (new documents added in batches several times a day), we update
> the searcher objects.
> In the warmup procedure we did a couple of searches and things work
> fine, BUT i realized that in our application we return hits sorted by
> date by default, and our warmup procedure did non-sorted queries... so
> still the first searches done by the user after an update was slow
> (obviously).
>
> To cope, I changed the warmup procedure to include a sorted search, and
> now the user will not notice slow queries. Good!
> But, the problem at hand is that we are running into memory problems
> (and I understand that sorting does consume a lot of memory...) But is
> there any way that is "best practice" to deal with this? The field we
> sort on is an un_indexed text field representing the date. typically
> "2008-10-10". I am aware that string field sorting consumes a lot of
> memory, so should we change this field to something different? Would
> this help us with the memory problems?
>
> As a sidenote / couriosity question: Does it matter if we use the search
> method returning Hits versus the search method returning TopFieldDocs?
> (we are not iterating them in any way when this memory issue occurs)
>
> Thanks in advance for any guidance we may get.
>
> Best regards,
>   Aleksander M. Stensby
>
>
>


-- 
Aleksander M. Stensby
Senior Software Developer
Integrasco A/S
+47 41 22 82 72
aleksander.stensby@integrasco.no

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org