Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of anoop.hbase@gmail.com
 designates 209.85.214.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAF1+Vs8N9r+w5J-wRQ7YtM4Bo998WeoFURLNSOihNLmbX2iv6Q@mail.gmail.com>
References: <loom.20121023T173704-953@post.gmane.org>
	<CAOtJ30q9Fx8UhfYRh2HDpVfD9fBSqQeYoH59VvMj6fZwCa72Gg@mail.gmail.com>
	<CAAT7MkqOJCELmfOZuD7-yTDONFm_8sDJSprQUb+wdXvNc9K5Bg@mail.gmail.com>
	<CAF1+Vs_-9=QxP=C3WJx90amhj0n8rmZ0La7fvpmwYTZG0exTXw@mail.gmail.com>
	<CAOtJ30qj1kBhu9Ub98r+thn-Y=L-012SO579N6VjVcDQzHuXAQ@mail.gmail.com>
	<CAOtJ30rXRmjW4gaL1gke=PJgQ85p75n18QBJm_NOKWJi9Gu0hg@mail.gmail.com>
	<CAF1+Vs9TtUibFyX=zjxVqjZXnXAvj6r7BFLto5f+E0WBHeqaOw@mail.gmail.com>
	<CAOtJ30p5vy1+H=Bww3R24DbguMfZAH2s_haOoB_abvqeA4qynQ@mail.gmail.com>
	<CAF1+Vs8N9r+w5J-wRQ7YtM4Bo998WeoFURLNSOihNLmbX2iv6Q@mail.gmail.com>
Date: Wed, 24 Oct 2012 12:01:02 +0530
Message-ID: 
 <CAOtJ30rfpn9tAqBp7aFBgc16mKiziW2XF1Jda_d-oDas0OdivQ@mail.gmail.com>
Subject: Re: Hbase import Tsv performance (slow import)
From: Anoop John <anoop.hbase@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=e89a8ff250ee082a8f04ccc83b0e

--e89a8ff250ee082a8f04ccc83b0e
Content-Type: text/plain; charset=ISO-8859-1

I think as per your explanation of need for unique id it is okey.. No need
to worry abt data loss.. As long as you can make sure you make a unique id
things are fine..  MR will make sure it run the job on whole data and the
o/p is persisted in file.. Yes this file is HFile(s) only.. Then finally
the HBase cluster is used for loading the HFiles to the Region stores..
Bulk loading huge data using this way will be much much faster than normal
put()s

-Anoop-

On Wed, Oct 24, 2012 at 11:44 AM, anil gupta <anilgupta84@gmail.com> wrote:

> Anoop: Only thing is that some
> mappers crashed.. So thin MR fw will run that mapper again on the same data
> set.. Then the unique id will be different?
>
> Anil: Yes, for the same dataset also the UniqueId will be different.
> UniqueID does not depends on the data.
>
> Thanks,
> Anil Gupta
>
> On Tue, Oct 23, 2012 at 11:07 PM, Anoop John <anoop.hbase@gmail.com>
> wrote:
>
> > >. Is there a way that i can explicitly turn on WAL for bulk loading?
> > no..
> > How you generate the unique id?  Remember that initial steps wont need
> the
> > HBase cluster at all. MR generates the HFiles and the o/p will be in file
> > only..  Mappers also will write o/p to file...  Only thing is that some
> > mappers crashed.. So thin MR fw will run that mapper again on the same
> data
> > set.. Then the unique id will be different? I think you no need to worry
> > about data loss from Hbase side..  So WAL is not required..
> >
> > -Anoop-
> >
> >
> >
> >
> > On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <anilgupta84@gmail.com>
> > wrote:
> >
> > > That's a very interesting fact. You made it clear but my custom Bulk
> > Loader
> > > generates an unique ID for every row in map phase. So, all my data is
> not
> > > in csv or text. Is there a way that i can explicitly turn on WAL for
> bulk
> > > loading?
> > >
> > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <anoop.hbase@gmail.com>
> > > wrote:
> > >
> > > > Hi Anil
> > > >                 In case of bulk loading it is not like data is put
> into
> > > > HBase one by one.. The MR job will create an o/p like HFile.. It will
> > > > create the KVs and write to file in order as how HFile will look
> like..
> > > The
> > > > the file is loaded into HBase finally.. Only for this final step
> HBase
> > RS
> > > > will be used.. So there is no point in WAL there...  I am making it
> > clear
> > > > for you?   The data is already present in form of raw data in some
> txt
> > or
> > > > csv file  :)
> > > >
> > > > -Anoop-
> > > >
> > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <anoop.hbase@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Anil
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <
> anilgupta84@gmail.com
> > > > >wrote:
> > > > >
> > > > >> Hi Anoop,
> > > > >>
> > > > >> As per your last email, did you mean that WAL is not used while
> > using
> > > > >> HBase
> > > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of
> > > > >> RegionServer failure?
> > > > >>
> > > > >> Thanks,
> > > > >> Anil Gupta
> > > > >>
> > > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan <
> > > > >> ramkrishna.s.vasudevan@gmail.com> wrote:
> > > > >>
> > > > >> > As Kevin suggested we can make use of bulk load that goes thro
> WAL
> > > and
> > > > >> > Memstore.  Or the second option will be to use the o/p of
> mappers
> > to
> > > > >> create
> > > > >> > HFiles directly.
> > > > >> >
> > > > >> > Regards
> > > > >> > Ram
> > > > >> >
> > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <
> > anoop.hbase@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Hi
> > > > >> > >     Using ImportTSV tool you are trying to bulk load your
> data.
> > > Can
> > > > >> you
> > > > >> > see
> > > > >> > > and tell how many mappers and reducers were there. Out of
> total
> > > time
> > > > >> what
> > > > >> > > is the time taken by the mapper phase and by the reducer
> phase.
> > > >  Seems
> > > > >> > like
> > > > >> > > MR related issue (may be some conf issue). In this bulk load
> > case
> > > > >> most of
> > > > >> > > the work is done by the MR job. It will read the raw data and
> > > > convert
> > > > >> it
> > > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The
> next
> > > > part
> > > > >> in
> > > > >> > > ImportTSV will just put the HFiles under the table region
> > store..
> > > > >>  There
> > > > >> > > wont be WAL usage in this bulk load.
> > > > >> > >
> > > > >> > > -Anoop-
> > > > >> > >
> > > > >> > > On Tue, Oct 23, 2012 at 9:18 PM, Nick maillard <
> > > > >> > > nicolas.maillard@fifty-five.com> wrote:
> > > > >> > >
> > > > >> > > > Hi everyone
> > > > >> > > >
> > > > >> > > > I'm starting with hbase and testing for our needs. I have
> set
> > > up a
> > > > >> > hadoop
> > > > >> > > > cluster of Three machines and A Hbase cluster atop on the
> same
> > > > three
> > > > >> > > > machines,
> > > > >> > > > one master two slaves.
> > > > >> > > >
> > > > >> > > > I am testing the Import of a 5GB csv file with the importTsv
> > > > tool. I
> > > > >> > > > import the
> > > > >> > > > file in the HDFS and use the importTsv tool to import in
> > Hbase.
> > > > >> > > >
> > > > >> > > > Right now it takes a little over an hour to complete. It
> > creates
> > > > >> > around 2
> > > > >> > > > million entries in one table with a single family.
> > > > >> > > > If I use bulk uploading it goes down to 20 minutes.
> > > > >> > > >
> > > > >> > > > My hadoop has 21 map tasks but they all seem to be taking a
> > very
> > > > >> long
> > > > >> > > time
> > > > >> > > > to
> > > > >> > > > finish many tasks end up in time out.
> > > > >> > > >
> > > > >> > > > I am wondering what I have missed in my configuration. I
> have
> > > > >> followed
> > > > >> > > the
> > > > >> > > > different prerequisites in the documentations but I am
> really
> > > > >> unsure as
> > > > >> > > to
> > > > >> > > > what
> > > > >> > > > is causing this slow down. If I were to apply the wordcount
> > > > example
> > > > >> to
> > > > >> > > the
> > > > >> > > > same
> > > > >> > > > file it takes only minutes to complete so I am guessing the
> > > issue
> > > > >> lies
> > > > >> > in
> > > > >> > > > my
> > > > >> > > > Hbase configuration.
> > > > >> > > >
> > > > >> > > > Any help or pointers would by appreciated
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Thanks & Regards,
> > > > >> Anil Gupta
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Anil Gupta
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

--e89a8ff250ee082a8f04ccc83b0e--