Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of jzillmann@googlemail.com
 designates 209.85.161.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=content-type:mime-version:subject:from:in-reply-to:date
         :content-transfer-encoding:message-id:references:to:x-mailer;
        b=JpfDeWrVzH+ocsi12K7bG6V3Nv05/j/uFur2EYRIwAygxo+vdfOAg3O1bOSyRTPIpt
         MR6dMspI1tD5gaDkV9pyAlNajrNBknb/gSM8dsXQAV5UdkososwvRstOAlZBUJX4Wumy
         E/hZBCC0kc8i+NR9QXyu7mm1XDiaTC9ga9O18=
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: Sharding Techniques
From: Johannes Zillmann <jzillmann@googlemail.com>
In-Reply-To: <BANLkTikoxCjPpLn7WrJ1+5Rq-VXNMD9Pyw@mail.gmail.com>
Date: Tue, 10 May 2011 09:52:13 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <916217D3-977E-4B13-B93B-C98BA3AC110C@googlemail.com>
References: <BANLkTimuytxpYnSRXR0qx6-H1VHxSExzKg@mail.gmail.com>
 <3972B0B283DD481EA213C10A4B51519C@sv.us.sonicwall.com>
 <BANLkTikoxCjPpLn7WrJ1+5Rq-VXNMD9Pyw@mail.gmail.com>
To: java-user@lucene.apache.org


On May 10, 2011, at 9:42 AM, Samarendra Pratap wrote:

> Hi,
> Though we have 30 GB total index, size of the indexes that are used
> in 75%-80% searches is 5 GB. and we have average search time around =
700 ms.
> (yes, we have optimized index).
>=20
> Could someone please throw some light on my original doubt!!!
> If I want to keep smaller indexes on different servers so that CPU and
> memory may be optimized, how can I aggregate the results of a query =
from
> each of the server. One thing I know is RMI which I studied a few =
years
> back, but that was too slow (or i thought so that time). What are =
other
> techniques?

There is also http://katta.sourceforge.net/ out there...

Johannes

>=20
> Is 1 second a bad search time for following?
> total index size: 30 GB
> index size which is being used in 80% searches - 5 GB
> number of fields - 40
> most of the fields being numeric fields.
> one big "contents" field with 500 - 1000 words.
> 3500 queries / second mostly on
> on an average a query uses 7 fields (1 big 6 small) with 25-30 tokens
>=20
> Are there any benchmarks from which I can compare the performance of =
my
> application? Or any approximate formula which can guide me
> calculating (using system parameters and index/search stats) the =
"best"
> expected search time?
>=20
> Thanks in advance
>=20
> On Tue, May 10, 2011 at 9:59 AM, Ganesh <emailgane@yahoo.co.in> wrote:
>=20
>> We are using similar technique as yours. We keep smaller indexes and =
use
>> ParallelMultiSearcher to search across the index. Keeping smaller =
indexes is
>> good as index and index optimzation would be faster.  There will be =
small
>> delay while searching across the indexes.
>>=20
>> 1. What is your search time?
>> 2. Is your index optimized?
>>=20
>> I have a doubt, If we keep the indexes size to 30 GB then each file =
size
>> (fdt, fdx etc) would in GB's. Small addition or deletion to the file =
will
>> not cause more IO as it has to skip those bytes and write it at the =
end of
>> file.
>>=20
>> Regards
>> Ganesh
>>=20
>>=20
>>=20
>> ----- Original Message -----
>> From: "Samarendra Pratap" <samarzone@gmail.com>
>> To: <java-user@lucene.apache.org>
>> Sent: Monday, May 09, 2011 5:26 PM
>> Subject: Sharding Techniques
>>=20
>>=20
>>> Hi list,
>>> We have an index directory of 30 GB which is divided into 3
>> subdirectories
>>> (idx1, idx2, idx3) which are again divided into 21 =
sub-subdirectories
>>> (idx1-1, idx1-2, ...., idx2-1, ...., idx3-1, ...., idx3-21).
>>>=20
>>> We are running with java 1.6, lucene 2.9 (going to upgrade to 3.1 =
very
>>> soon), linux (fedora core - kernel 2.6.17-13.1), reiserfs.
>>>=20
>>> We have almost 40 fields in each index (is it a bad to have so many
>>> fields?). most of them are id based fields.
>>> We are using 8 servers for search, and each of which receives
>> approximately
>>> 3000/hour queries in peak hour and search time of more than 1 second =
is
>>> considered bad (is it really bad?) as per the business requirement.
>>>=20
>>> Since past few months we are experiencing issues (load and search =
time)
>> on
>>> our search servers, due to which I am looking for sharding =
techniques.
>> Can
>>> someone guide or give me pointers where i can read more and test?
>>>=20
>>> Keeping parts of indexes on different servers search on all of them =
and
>> then
>>> merging the results - what could be the best approach?
>>>=20
>>> Let me tell you that most queries use only 6-7 indexes and 4 - 5 =
fields
>> (to
>>> search for) but some queries (searching all the data) require all =
the
>>> indexes and are primary cause of the performance degradation.
>>>=20
>>> Any suggestions/ideas are greatly appreciated. And further more will
>>> sharding (or similar thing) really reduce search time? (load is a =
less
>>> severe issue when compared to search time)
>>>=20
>>>=20
>>> --
>>> Regards,
>>> Samar
>>>=20
>>=20
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>=20
>>=20
>=20
>=20
> --=20
> Regards,
> Samar


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org