Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7022DD66E for ; Wed, 24 Oct 2012 06:31:35 +0000 (UTC) Received: (qmail 93740 invoked by uid 500); 24 Oct 2012 06:31:33 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 93529 invoked by uid 500); 24 Oct 2012 06:31:32 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 93497 invoked by uid 99); 24 Oct 2012 06:31:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 06:31:31 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of anoop.hbase@gmail.com designates 209.85.214.169 as permitted sender) Received: from [209.85.214.169] (HELO mail-ob0-f169.google.com) (209.85.214.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 06:31:24 +0000 Received: by mail-ob0-f169.google.com with SMTP id va7so181988obc.14 for ; Tue, 23 Oct 2012 23:31:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=MqVCU/Y9nE8lLazFdxdVy1+StXDTPDIntYHR1EX6kgc=; b=gUBTxvNKURcdck8Ww/Q8wnvdeiOMYJ1kWC7Ov4tqvg7x5Ay5aI9AAi0NzX5S0VikK9 QZ64AILhGWBM4f5xejqdsf0YtsWbk6LQCAkoUXioGCCpJPJFDYwS8TouQ6ImwZphKibL ZUWyiDLi5dMKn1sGk6RjcIqubjHsFT7hDHB3LzhOSVW5UZccjNyXfe0hJKH3pAY2utQl UEOoh0kJk6//GB+OVckAkpKW0pT8cW0aveBE1wD5ME8EZuDEMf2/T28PNFIVJvvetiPq Et2OL6YZPN3kQCW0wH0srx4CM816Zzp+VwJoDWw3nY4/z3w39I5MQFun4PsXdU6SeTNl FF/A== MIME-Version: 1.0 Received: by 10.60.29.228 with SMTP id n4mr13513041oeh.27.1351060262758; Tue, 23 Oct 2012 23:31:02 -0700 (PDT) Received: by 10.60.6.161 with HTTP; Tue, 23 Oct 2012 23:31:02 -0700 (PDT) In-Reply-To: References: Date: Wed, 24 Oct 2012 12:01:02 +0530 Message-ID: Subject: Re: Hbase import Tsv performance (slow import) From: Anoop John To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=e89a8ff250ee082a8f04ccc83b0e X-Virus-Checked: Checked by ClamAV on apache.org --e89a8ff250ee082a8f04ccc83b0e Content-Type: text/plain; charset=ISO-8859-1 I think as per your explanation of need for unique id it is okey.. No need to worry abt data loss.. As long as you can make sure you make a unique id things are fine.. MR will make sure it run the job on whole data and the o/p is persisted in file.. Yes this file is HFile(s) only.. Then finally the HBase cluster is used for loading the HFiles to the Region stores.. Bulk loading huge data using this way will be much much faster than normal put()s -Anoop- On Wed, Oct 24, 2012 at 11:44 AM, anil gupta wrote: > Anoop: Only thing is that some > mappers crashed.. So thin MR fw will run that mapper again on the same data > set.. Then the unique id will be different? > > Anil: Yes, for the same dataset also the UniqueId will be different. > UniqueID does not depends on the data. > > Thanks, > Anil Gupta > > On Tue, Oct 23, 2012 at 11:07 PM, Anoop John > wrote: > > > >. Is there a way that i can explicitly turn on WAL for bulk loading? > > no.. > > How you generate the unique id? Remember that initial steps wont need > the > > HBase cluster at all. MR generates the HFiles and the o/p will be in file > > only.. Mappers also will write o/p to file... Only thing is that some > > mappers crashed.. So thin MR fw will run that mapper again on the same > data > > set.. Then the unique id will be different? I think you no need to worry > > about data loss from Hbase side.. So WAL is not required.. > > > > -Anoop- > > > > > > > > > > On Wed, Oct 24, 2012 at 10:58 AM, anil gupta > > wrote: > > > > > That's a very interesting fact. You made it clear but my custom Bulk > > Loader > > > generates an unique ID for every row in map phase. So, all my data is > not > > > in csv or text. Is there a way that i can explicitly turn on WAL for > bulk > > > loading? > > > > > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John > > > wrote: > > > > > > > Hi Anil > > > > In case of bulk loading it is not like data is put > into > > > > HBase one by one.. The MR job will create an o/p like HFile.. It will > > > > create the KVs and write to file in order as how HFile will look > like.. > > > The > > > > the file is loaded into HBase finally.. Only for this final step > HBase > > RS > > > > will be used.. So there is no point in WAL there... I am making it > > clear > > > > for you? The data is already present in form of raw data in some > txt > > or > > > > csv file :) > > > > > > > > -Anoop- > > > > > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John > > > > wrote: > > > > > > > > > Hi Anil > > > > > > > > > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta < > anilgupta84@gmail.com > > > > >wrote: > > > > > > > > > >> Hi Anoop, > > > > >> > > > > >> As per your last email, did you mean that WAL is not used while > > using > > > > >> HBase > > > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of > > > > >> RegionServer failure? > > > > >> > > > > >> Thanks, > > > > >> Anil Gupta > > > > >> > > > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > > > > >> ramkrishna.s.vasudevan@gmail.com> wrote: > > > > >> > > > > >> > As Kevin suggested we can make use of bulk load that goes thro > WAL > > > and > > > > >> > Memstore. Or the second option will be to use the o/p of > mappers > > to > > > > >> create > > > > >> > HFiles directly. > > > > >> > > > > > >> > Regards > > > > >> > Ram > > > > >> > > > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John < > > anoop.hbase@gmail.com> > > > > >> wrote: > > > > >> > > > > > >> > > Hi > > > > >> > > Using ImportTSV tool you are trying to bulk load your > data. > > > Can > > > > >> you > > > > >> > see > > > > >> > > and tell how many mappers and reducers were there. Out of > total > > > time > > > > >> what > > > > >> > > is the time taken by the mapper phase and by the reducer > phase. > > > > Seems > > > > >> > like > > > > >> > > MR related issue (may be some conf issue). In this bulk load > > case > > > > >> most of > > > > >> > > the work is done by the MR job. It will read the raw data and > > > > convert > > > > >> it > > > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The > next > > > > part > > > > >> in > > > > >> > > ImportTSV will just put the HFiles under the table region > > store.. > > > > >> There > > > > >> > > wont be WAL usage in this bulk load. > > > > >> > > > > > > >> > > -Anoop- > > > > >> > > > > > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > > > > >> > > nicolas.maillard@fifty-five.com> wrote: > > > > >> > > > > > > >> > > > Hi everyone > > > > >> > > > > > > > >> > > > I'm starting with hbase and testing for our needs. I have > set > > > up a > > > > >> > hadoop > > > > >> > > > cluster of Three machines and A Hbase cluster atop on the > same > > > > three > > > > >> > > > machines, > > > > >> > > > one master two slaves. > > > > >> > > > > > > > >> > > > I am testing the Import of a 5GB csv file with the importTsv > > > > tool. I > > > > >> > > > import the > > > > >> > > > file in the HDFS and use the importTsv tool to import in > > Hbase. > > > > >> > > > > > > > >> > > > Right now it takes a little over an hour to complete. It > > creates > > > > >> > around 2 > > > > >> > > > million entries in one table with a single family. > > > > >> > > > If I use bulk uploading it goes down to 20 minutes. > > > > >> > > > > > > > >> > > > My hadoop has 21 map tasks but they all seem to be taking a > > very > > > > >> long > > > > >> > > time > > > > >> > > > to > > > > >> > > > finish many tasks end up in time out. > > > > >> > > > > > > > >> > > > I am wondering what I have missed in my configuration. I > have > > > > >> followed > > > > >> > > the > > > > >> > > > different prerequisites in the documentations but I am > really > > > > >> unsure as > > > > >> > > to > > > > >> > > > what > > > > >> > > > is causing this slow down. If I were to apply the wordcount > > > > example > > > > >> to > > > > >> > > the > > > > >> > > > same > > > > >> > > > file it takes only minutes to complete so I am guessing the > > > issue > > > > >> lies > > > > >> > in > > > > >> > > > my > > > > >> > > > Hbase configuration. > > > > >> > > > > > > > >> > > > Any help or pointers would by appreciated > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> -- > > > > >> Thanks & Regards, > > > > >> Anil Gupta > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Thanks & Regards, > > > Anil Gupta > > > > > > > > > -- > Thanks & Regards, > Anil Gupta > --e89a8ff250ee082a8f04ccc83b0e--