Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 50867 invoked from network); 6 Sep 2010 05:02:43 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Sep 2010 05:02:43 -0000 Received: (qmail 35641 invoked by uid 500); 6 Sep 2010 05:02:42 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 35215 invoked by uid 500); 6 Sep 2010 05:02:38 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 35207 invoked by uid 99); 6 Sep 2010 05:02:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Sep 2010 05:02:37 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [121.101.151.147] (HELO web137320.mail.in.yahoo.com) (121.101.151.147) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 06 Sep 2010 05:02:28 +0000 Received: (qmail 36077 invoked by uid 60001); 6 Sep 2010 05:02:06 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1283749326; bh=ORwMpP8ODYgFx8PYnQxGBTFU4Lo2jjPAFfVVCwtBhzI=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=Teo20j1a7ycu3T3P9IIf/3TEW//ogUgskYjOMRgwq39m7KeWmsDiScTC9JxqACuVXUMgw/aWbBp6B0RqvBd8kNJj0wu4EoyaDcCgsJrC33tWiZHeOm56FMM3v4pwbXHsbJQNsIAfBHqPZ3XyduwgyVNICFsBI/pkT8KJZrMgjzQ= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type; b=FNXcmQMx8RtoZAXS+XrqzzI2IGebgQ+Cfj52Qjxg4h9sz4eOVl0nw0OUVf9Iv+PRVbeOPar3yzeiZOc2maTiTWbeFGvbeaKuzj9wuWikMcr/Wxgv52GInsqnayENPYXF1C3zdbku/Pc7BBUQSQlnNCGh4igqWOg5Uqh/BT2rVw8=; Message-ID: <137933.34958.qm@web137320.mail.in.yahoo.com> X-YMail-OSG: BWonPYMVM1m69gYYeW93K6bq.hIv1g__eiV4WPBYX.TiAXI 9n9VveRmq5EXkaDEtLV0O_ogqwz4MvkLHAwFm5arAxpT3oM5HuwB5ZJO2rHn jJylRR6lySwyTab8581b1eSSMX790PpKStcAsUy7Gzb1QtfGvd21gSNJWpaA 3cPAEFKgj2zd9E.YhG.X_0hJ6nfqZB.Rh9seG3YIfRaaN94N4T6V4a.ApDSm 0SRcXtu1Du2CmjGtMcc.oGBv2R9Xbti5OsbgtNwygiCnUYhT_k3RhJ4eAwZQ 1x7dWm6s3QCI3Dwe8SMs87_CakNNc4XsP Received: from [87.238.84.64] by web137320.mail.in.yahoo.com via HTTP; Mon, 06 Sep 2010 10:32:05 IST X-Mailer: YahooMailRC/470 YahooMailWebService/0.8.105.279950 References: <692424.89426.qm@web137319.mail.in.yahoo.com> <747855.89176.qm@web137312.mail.in.yahoo.com> <856663.46924.qm@web137315.mail.in.yahoo.com> Date: Mon, 6 Sep 2010 10:32:05 +0530 (IST) From: "Murali Krishna. P" Subject: Re: HBase secondary index performance To: user@hbase.apache.org In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-1947991794-1283749325=:34958" X-Virus-Checked: Checked by ClamAV on apache.org --0-1947991794-1283749325=:34958 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi,=0A My row size is around 300 bytes with total 20 columns. I tried the= custom =0Aindexing without the write to WAL. Currently having only 2 table= s, one for the =0Amain table and another for all 20 indexes. My key to the = index table is =0AcolumnValue+columnName+rowKey.=0AI am getting around 500 = inserts/second now. (ie, total of ~10K puts). This is =0Aprobably comparabl= e with your numbers based on the data size.=0A I have some doubts with the= hbase write implementation. =0A* Is this the max that we can achieve with = any number of region servers? Why =0Aadding region servers not improving th= e write performance? Is it because when =0Athe data doesn't exist in the ta= ble, it always writes to one region ? =0A=0A* Probably writing to an existi= ng, well distributed table might give better =0Aperformance since the write= s will be across machines ? In that case, if we have =0Amultiple tables (on= e per index), will it be better during the initial write =0Aitself (since e= ach table has different region) ??=0A=0A Thanks,=0AMurali Krishna=0A=0A=0A= =0A=0A________________________________=0AFrom: Andrey Stepachev =0ATo: user@hbase.apache.org=0ASent: Sun, 5 September, 2010 11:54:45= PM=0ASubject: Re: HBase secondary index performance=0A=0A2010/9/5 Murali K= rishna. P :=0A> Hi,=0A> Thanks for the detai= led explanation, I liked the idea of timestamp=0A> check, this will be good= enough for us and I can put a periodic MR cleaner.=0A> However I need some= help in understanding the 30K number that was claimed.=0A=0AReal insert ra= te will depend on size of row, size of write buffer etc.=0AIn case of simpl= e row with one long per row i got 30k requests/second=0A(shown in hbase).= =0AJson serialised objects 100-700bytes each, with validation I can insert = 2-6k=0Aobjects (json) per second.=0A=0AWith=0A> the IndexedTable approach, = I got only 1200rows/s (60rows/s X 20 index =0Acolumns).=0A> I understood th= at there arean additional reads that indextable does but 25X=0A> improveme= nt that you got is very impressive. Can you please help me to=0A> understan= d this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)=0A=0ADid you try to in= sert data into non indexed region (disable=0Aindexedtables extension)?=0AWh= at numbers did you got?=0A=0A>=0A> Thanks,=0A> Murali Krishna=0A>=0A>=0A>= =0A>=0A> ________________________________=0A> From: Andrey Stepachev =0A> To: user@hbase.apache.org=0A> Sent: Sun, 5 September, 2010= 3:53:26 AM=0A> Subject: Re: HBase secondary index performance=0A>=0A> 2010= /9/3 Murali Krishna. P :=0A>=0A>> * custom i= ndexing is good, but our data keeps changing every day. So,=0A>>probably=0A= >> indextable is the best option for us=0A>=0A> In case of custom indexing = you can use timestamps to check, that index=0A> record still valid.=0A> (or= ever simply recheck existance of the value)=0A> Also you need regular inde= x cleanup (mr job or some custom application).=0A>=0A> To index some row id= entified by 'key' having 'value', we can create=0A> index table,=0A> where = key will be [value:key] and insert rows every time, when we insert=0A> our = values. We will got 30k rows/s/node.=0A> When we want to find all 'value', = we scan [value:0000, value:9999] and=0A> find all keys,=0A> which point to = rows, containing values.=0A> We scan index, random get rows, recheck, that = index is still valid=0A> (check value or timestamp, index timestamp should = be >=3D value timestamp) and=0A> return only valid values (may be we can ev= en delete on the fly when we=0A> got negative=0A> result to automatically c= lenup stale data).=0A>=0A>=0A>> * Just added one more regionserver a= nd it did not help. Actually it =0A>went=0A>>back=0A>> to 60/s for some str= ange reason(with one client). The requests in the hbase =0Aui=0A>> is not u= niform across 2 region servers. One server is doing around 2000 and=0A> the= =0A>> other 500. Probably once the region gets split and when we have lots = of data,=0A>> writes will improve ? (Now it is just writing to one region f= or the main=0A> table)=0A>=0A> Looks like all data goes to one region serve= r. Try to make more random writes=0A> (may be you should make key as random= uuid or other key randomization =0A>technique)=0A>=0A>> * Is there = some way to do bulk load the indexedtable? Earlier I have=0A>>used the=0A>>= bulk loader tool (mapreduce job which creates the regions offline) but not= =0A> sure=0A>> whether it works with indexed table.=0A>=0A> No sure, but yo= u can look at source code, and try to emulate indexing=0A> operations in=0A= > your code after regular bulk loading.=0A>=0A>>=0A>>=0A>> Thanks,=0A>> Mu= rali Krishna=0A>>=0A>>=0A>=0A> Andrey.=0A>=0A --0-1947991794-1283749325=:34958--