Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 91C56D8DE for ; Thu, 4 Oct 2012 04:25:16 +0000 (UTC) Received: (qmail 19811 invoked by uid 500); 4 Oct 2012 04:25:14 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 19549 invoked by uid 500); 4 Oct 2012 04:25:11 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 19529 invoked by uid 99); 4 Oct 2012 04:25:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2012 04:25:10 +0000 X-ASF-Spam-Status: No, hits=2.0 required=5.0 tests=FRT_ADOBE2,MSGID_MULTIPLE_AT,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ramkrishna.vasudevan@huawei.com designates 119.145.14.65 as permitted sender) Received: from [119.145.14.65] (HELO szxga02-in.huawei.com) (119.145.14.65) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2012 04:25:05 +0000 Received: from 172.24.2.119 (EHLO szxeml205-edg.china.huawei.com) ([172.24.2.119]) by szxrg02-dlp.huawei.com (MOS 4.3.4-GA FastPath queued) with ESMTP id AQA85217; Thu, 04 Oct 2012 12:24:42 +0800 (CST) Received: from SZXEML406-HUB.china.huawei.com (10.82.67.93) by szxeml205-edg.china.huawei.com (172.24.2.58) with Microsoft SMTP Server (TLS) id 14.1.323.3; Thu, 4 Oct 2012 12:23:08 +0800 Received: from blrprnc05ns (10.18.96.94) by szxeml406-hub.china.huawei.com (10.82.67.93) with Microsoft SMTP Server id 14.1.323.3; Thu, 4 Oct 2012 12:23:09 +0800 From: "Ramkrishna.S.Vasudevan" To: References: In-Reply-To: Subject: RE: Bulk Loads and Updates Date: Thu, 4 Oct 2012 09:53:09 +0530 Message-ID: <006301cda1e7$f60ff290$e22fd7b0$@vasudevan@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Ac2hpho1TeJmYqMRRZWJsu7sjNh+9AAQaWMg Content-Language: en-us X-Originating-IP: [10.18.96.94] X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org Which version of HBASE are you using? As part of HBASE-5564 a feature was introduced to handle duplicate records in bulk load using timestamp also to be specified in the file like how we specify the column family and table name. If you can backport it to your version hope it will be helpful. Regards Ram > -----Original Message----- > From: Eugeny Morozov [mailto:emorozov@griddynamics.com] > Sent: Thursday, October 04, 2012 2:01 AM > To: user@hbase.apache.org > Subject: Re: Bulk Loads and Updates > > Hi! > > Sure, you do, but don't forget to sort all KV pairs before put them > into > context. Or else you'd get some "unsorted" expection. > > If you have them completely the same and you need to reduce number of > same > lines you could use Combiner, but their behavior is not deterministic, > so > basically there is no guarantee that it'll be run and how many times. > > > On Thu, Oct 4, 2012 at 12:22 AM, gordoslocos > wrote: > > > Thank you Paul. > > > > I was just thinking that I could use add a reducer to the step that > > prepares the data to build custom logic around having multiple > entries > > which produce the same rowkey. What do u think? > > > > Sent from my iPhone > > > > On 03/10/2012, at 17:12, Paul Mackles wrote: > > > > > Keys in hbase are a combination of rowkey/column/timestamp. > > > > > > Two records with the same rowkey but different column will result > in two > > > different cells with the same rowkey which is probably what you > expect. > > > > > > For two records with the same rowkey and same column, the timestamp > will > > > normally differentiate them but in the case of a bulk load, the > timestamp > > > could be the same so it may actually be a tie and both will be > stored. > > > There are no updates in bulk loads. > > > > > > All 20 versions will get loaded but the 10 oldest will be deleted > during > > > the next major compaction. > > > > > > I would definitely recommend setting up small scale tests for all > of the > > > above scenarios to confirm. > > > > > > On 10/3/12 3:35 PM, "Juan P." wrote: > > > > > >> Hi guys, > > >> I've been reading up on bulk load using MapReduce jobs and I > wanted to > > >> validate something. > > >> > > >> If I the input I wanted to load into HBase produced the same key > for > > >> several lines. How will HBase handle that? > > >> > > >> I understand the MapReduce job will create StoreFiles which the > region > > >> servers just pick up and make available to the users. But is there > a > > >> validation to treat the first as insert and the rest as updates? > > >> > > >> What about the limit on the number of versions of a key HBase can > have? > > If > > >> I want to have 10 versions, but the bulk load has 20 values for > the same > > >> key, will it only keep the last 10? > > >> > > >> Thanks, > > >> Juan > > > > > > > > > -- > Evgeny Morozov > Developer Grid Dynamics > Skype: morozov.evgeny > www.griddynamics.com > emorozov@griddynamics.com