Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 782945F5A for ; Tue, 10 May 2011 07:52:50 +0000 (UTC) Received: (qmail 98416 invoked by uid 500); 10 May 2011 07:52:48 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 97954 invoked by uid 500); 10 May 2011 07:52:48 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 97945 invoked by uid 99); 10 May 2011 07:52:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 May 2011 07:52:47 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jzillmann@googlemail.com designates 209.85.161.48 as permitted sender) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 May 2011 07:52:41 +0000 Received: by fxm7 with SMTP id 7so7547565fxm.35 for ; Tue, 10 May 2011 00:52:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:content-type:mime-version:subject:from :in-reply-to:date:content-transfer-encoding:message-id:references:to :x-mailer; bh=hIh0ftrNhW94auoJhNQTPFA8JZOUFmpxE7ICKwIte+I=; b=w/wu0JhNQ3fB8pPw72p4uciweC9cva5zQXPtf8NUXAypJ2jV7+Bs7N3mjAF188ApUz ZKAzYpI3sNY5nOpwaUVNvIgcwfs34hPnIrJTb6LexbXdXnKol3j3KeRGDwHqwDsPObZ5 /m/MGeOW8/DQrcrSQg/v5wH8THjlTJgE5RpUI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; b=JpfDeWrVzH+ocsi12K7bG6V3Nv05/j/uFur2EYRIwAygxo+vdfOAg3O1bOSyRTPIpt MR6dMspI1tD5gaDkV9pyAlNajrNBknb/gSM8dsXQAV5UdkososwvRstOAlZBUJX4Wumy E/hZBCC0kc8i+NR9QXyu7mm1XDiaTC9ga9O18= Received: by 10.223.145.78 with SMTP id c14mr626247fav.75.1305013940797; Tue, 10 May 2011 00:52:20 -0700 (PDT) Received: from [192.168.0.101] (178-24-202-214-dynip.superkabel.de [178.24.202.214]) by mx.google.com with ESMTPS id c3sm1705755fav.3.2011.05.10.00.52.18 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 10 May 2011 00:52:19 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: Sharding Techniques From: Johannes Zillmann In-Reply-To: Date: Tue, 10 May 2011 09:52:13 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <916217D3-977E-4B13-B93B-C98BA3AC110C@googlemail.com> References: <3972B0B283DD481EA213C10A4B51519C@sv.us.sonicwall.com> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1084) X-Virus-Checked: Checked by ClamAV on apache.org On May 10, 2011, at 9:42 AM, Samarendra Pratap wrote: > Hi, > Though we have 30 GB total index, size of the indexes that are used > in 75%-80% searches is 5 GB. and we have average search time around = 700 ms. > (yes, we have optimized index). >=20 > Could someone please throw some light on my original doubt!!! > If I want to keep smaller indexes on different servers so that CPU and > memory may be optimized, how can I aggregate the results of a query = from > each of the server. One thing I know is RMI which I studied a few = years > back, but that was too slow (or i thought so that time). What are = other > techniques? There is also http://katta.sourceforge.net/ out there... Johannes >=20 > Is 1 second a bad search time for following? > total index size: 30 GB > index size which is being used in 80% searches - 5 GB > number of fields - 40 > most of the fields being numeric fields. > one big "contents" field with 500 - 1000 words. > 3500 queries / second mostly on > on an average a query uses 7 fields (1 big 6 small) with 25-30 tokens >=20 > Are there any benchmarks from which I can compare the performance of = my > application? Or any approximate formula which can guide me > calculating (using system parameters and index/search stats) the = "best" > expected search time? >=20 > Thanks in advance >=20 > On Tue, May 10, 2011 at 9:59 AM, Ganesh wrote: >=20 >> We are using similar technique as yours. We keep smaller indexes and = use >> ParallelMultiSearcher to search across the index. Keeping smaller = indexes is >> good as index and index optimzation would be faster. There will be = small >> delay while searching across the indexes. >>=20 >> 1. What is your search time? >> 2. Is your index optimized? >>=20 >> I have a doubt, If we keep the indexes size to 30 GB then each file = size >> (fdt, fdx etc) would in GB's. Small addition or deletion to the file = will >> not cause more IO as it has to skip those bytes and write it at the = end of >> file. >>=20 >> Regards >> Ganesh >>=20 >>=20 >>=20 >> ----- Original Message ----- >> From: "Samarendra Pratap" >> To: >> Sent: Monday, May 09, 2011 5:26 PM >> Subject: Sharding Techniques >>=20 >>=20 >>> Hi list, >>> We have an index directory of 30 GB which is divided into 3 >> subdirectories >>> (idx1, idx2, idx3) which are again divided into 21 = sub-subdirectories >>> (idx1-1, idx1-2, ...., idx2-1, ...., idx3-1, ...., idx3-21). >>>=20 >>> We are running with java 1.6, lucene 2.9 (going to upgrade to 3.1 = very >>> soon), linux (fedora core - kernel 2.6.17-13.1), reiserfs. >>>=20 >>> We have almost 40 fields in each index (is it a bad to have so many >>> fields?). most of them are id based fields. >>> We are using 8 servers for search, and each of which receives >> approximately >>> 3000/hour queries in peak hour and search time of more than 1 second = is >>> considered bad (is it really bad?) as per the business requirement. >>>=20 >>> Since past few months we are experiencing issues (load and search = time) >> on >>> our search servers, due to which I am looking for sharding = techniques. >> Can >>> someone guide or give me pointers where i can read more and test? >>>=20 >>> Keeping parts of indexes on different servers search on all of them = and >> then >>> merging the results - what could be the best approach? >>>=20 >>> Let me tell you that most queries use only 6-7 indexes and 4 - 5 = fields >> (to >>> search for) but some queries (searching all the data) require all = the >>> indexes and are primary cause of the performance degradation. >>>=20 >>> Any suggestions/ideas are greatly appreciated. And further more will >>> sharding (or similar thing) really reduce search time? (load is a = less >>> severe issue when compared to search time) >>>=20 >>>=20 >>> -- >>> Regards, >>> Samar >>>=20 >>=20 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >>=20 >>=20 >=20 >=20 > --=20 > Regards, > Samar --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org