Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EFB44D564 for ; Wed, 24 Oct 2012 05:52:36 +0000 (UTC) Received: (qmail 14328 invoked by uid 500); 24 Oct 2012 05:52:35 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 13557 invoked by uid 500); 24 Oct 2012 05:52:34 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 13529 invoked by uid 99); 24 Oct 2012 05:52:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 05:52:33 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ramkrishna.s.vasudevan@gmail.com designates 209.85.215.169 as permitted sender) Received: from [209.85.215.169] (HELO mail-ea0-f169.google.com) (209.85.215.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 05:52:26 +0000 Received: by mail-ea0-f169.google.com with SMTP id k11so36606eaa.14 for ; Tue, 23 Oct 2012 22:52:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=tIWnQgjSHpzlIQ/t0QNUBooMn9XyqTlFlHqXTxQ6obE=; b=XzQIrodWKu8pwGbCu4UgOl2CniPjXX/drRCOS55MYffSydIaySTs+cHRJK/J8v62ql xSrWusY8FKwgHDHXSQ9FUq+LfsDr8AuN4i5JabaZ1yj9bEIHcdJhtb7mhrL34SkLBiM0 mG8oIFAxDA/O7T09hDiw44K1Ed6F9F1tdGWLdvG/iaZdAHRTt5qiC4LqC0Bx5TUGvWLZ G+ZWp/W5/tVxNE4xPcmHGQqx6MaEbZzZKmHGYMjGVkdSV04tgIZYXcrYsnwPe+BOju38 TfecsjZjvbZfZBut3PP1DZRUM5EDU3Mn4mdyGdKH0ZbFQGrtLxCqrb9YniZ/DQ+jfI49 bETA== MIME-Version: 1.0 Received: by 10.14.194.71 with SMTP id l47mr19765488een.6.1351057926429; Tue, 23 Oct 2012 22:52:06 -0700 (PDT) Received: by 10.14.96.7 with HTTP; Tue, 23 Oct 2012 22:52:06 -0700 (PDT) In-Reply-To: References: Date: Wed, 24 Oct 2012 11:22:06 +0530 Message-ID: Subject: Re: Hbase import Tsv performance (slow import) From: ramkrishna vasudevan To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=047d7b3436b0c69a9504ccc7af31 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b3436b0c69a9504ccc7af31 Content-Type: text/plain; charset=ISO-8859-1 Anil, When you do ImportTSV the data that is present in the the TSV file alone will be parsed and loaded into HBase. How are you planning to generate the UniqueID? Your usecase seems like it your data is in CSV file but the unique id that you need is not part of the TSV. Now you need them to be loaded to HBASE thro WAL. I would suggest that can you first do a loading of the existing TSV file to one HTable. Then from that table you can do a bulk load into another table using ur custom mapper. Here you can use the logic of generating unique ID for every row that comes out from the loaded table. Here we can make the data to be inserted into the new table thro normal puts which will use the WAL and memstore. Regards Ram On Wed, Oct 24, 2012 at 10:58 AM, anil gupta wrote: > That's a very interesting fact. You made it clear but my custom Bulk Loader > generates an unique ID for every row in map phase. So, all my data is not > in csv or text. Is there a way that i can explicitly turn on WAL for bulk > loading? > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John > wrote: > > > Hi Anil > > In case of bulk loading it is not like data is put into > > HBase one by one.. The MR job will create an o/p like HFile.. It will > > create the KVs and write to file in order as how HFile will look like.. > The > > the file is loaded into HBase finally.. Only for this final step HBase RS > > will be used.. So there is no point in WAL there... I am making it clear > > for you? The data is already present in form of raw data in some txt or > > csv file :) > > > > -Anoop- > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John > > wrote: > > > > > Hi Anil > > > > > > > > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta > >wrote: > > > > > >> Hi Anoop, > > >> > > >> As per your last email, did you mean that WAL is not used while using > > >> HBase > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of > > >> RegionServer failure? > > >> > > >> Thanks, > > >> Anil Gupta > > >> > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan < > > >> ramkrishna.s.vasudevan@gmail.com> wrote: > > >> > > >> > As Kevin suggested we can make use of bulk load that goes thro WAL > and > > >> > Memstore. Or the second option will be to use the o/p of mappers to > > >> create > > >> > HFiles directly. > > >> > > > >> > Regards > > >> > Ram > > >> > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John > > >> wrote: > > >> > > > >> > > Hi > > >> > > Using ImportTSV tool you are trying to bulk load your data. > Can > > >> you > > >> > see > > >> > > and tell how many mappers and reducers were there. Out of total > time > > >> what > > >> > > is the time taken by the mapper phase and by the reducer phase. > > Seems > > >> > like > > >> > > MR related issue (may be some conf issue). In this bulk load case > > >> most of > > >> > > the work is done by the MR job. It will read the raw data and > > convert > > >> it > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The next > > part > > >> in > > >> > > ImportTSV will just put the HFiles under the table region store.. > > >> There > > >> > > wont be WAL usage in this bulk load. > > >> > > > > >> > > -Anoop- > > >> > > > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard < > > >> > > nicolas.maillard@fifty-five.com> wrote: > > >> > > > > >> > > > Hi everyone > > >> > > > > > >> > > > I'm starting with hbase and testing for our needs. I have set > up a > > >> > hadoop > > >> > > > cluster of Three machines and A Hbase cluster atop on the same > > three > > >> > > > machines, > > >> > > > one master two slaves. > > >> > > > > > >> > > > I am testing the Import of a 5GB csv file with the importTsv > > tool. I > > >> > > > import the > > >> > > > file in the HDFS and use the importTsv tool to import in Hbase. > > >> > > > > > >> > > > Right now it takes a little over an hour to complete. It creates > > >> > around 2 > > >> > > > million entries in one table with a single family. > > >> > > > If I use bulk uploading it goes down to 20 minutes. > > >> > > > > > >> > > > My hadoop has 21 map tasks but they all seem to be taking a very > > >> long > > >> > > time > > >> > > > to > > >> > > > finish many tasks end up in time out. > > >> > > > > > >> > > > I am wondering what I have missed in my configuration. I have > > >> followed > > >> > > the > > >> > > > different prerequisites in the documentations but I am really > > >> unsure as > > >> > > to > > >> > > > what > > >> > > > is causing this slow down. If I were to apply the wordcount > > example > > >> to > > >> > > the > > >> > > > same > > >> > > > file it takes only minutes to complete so I am guessing the > issue > > >> lies > > >> > in > > >> > > > my > > >> > > > Hbase configuration. > > >> > > > > > >> > > > Any help or pointers would by appreciated > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >> > > >> -- > > >> Thanks & Regards, > > >> Anil Gupta > > >> > > > > > > > > > > > > -- > Thanks & Regards, > Anil Gupta > --047d7b3436b0c69a9504ccc7af31--