Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of anoop.hbase@gmail.com
 designates 209.85.128.50 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFHL1WbOdw3QuRyqQH32mQT=4kUD5O=QDtP=D1u4zhxKcm2iwA@mail.gmail.com>
References: 
 <CAFHL1WbW9gdoJJSGMh3LU-zG9xqvmZeJ4c4CR2_60sC4Q8Ng1w@mail.gmail.com>
	<1B331809-0487-403C-AAE1-7A635DECB230@gmail.com>
	<CAFHL1WZpmXyjcziTp0f8Bz0Sa078=Nad=CGhGDgKzCqyEV2uag@mail.gmail.com>
	<CALte62x6VE9xZA6Z1JPWNHh1NNBWDMnsijMboDd+U3xoZ-grxQ@mail.gmail.com>
	<CAFHL1WbOdw3QuRyqQH32mQT=4kUD5O=QDtP=D1u4zhxKcm2iwA@mail.gmail.com>
Date: Sat, 24 Aug 2013 10:25:29 +0530
Message-ID: 
 <CAOtJ30rkY_n2G14j5aa2MX8BC6UK2m9ca7Pqpp_32YeRVzhPoQ@mail.gmail.com>
Subject: Re: best approach for write and immediate read use case
From: Anoop John <anoop.hbase@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=047d7b621fb211636f04e4aa5532

--047d7b621fb211636f04e4aa5532
Content-Type: text/plain; charset=ISO-8859-1

>What would be the behavior for inserting data using map reduce job? would
the recently added records be in the memstore? or I need to load them for
read queries after the insert is done?

Using MR you have 2 options for insertion. One will create the HFiles
directly as o/p  (Using HFileOutputFormat)  Here there is no memstore
coming into picture. In the other one there will be calls to HTable#put()
from mappers.  Here memstore will come into picture.(These are mapper alone
jobs)   When you are using ImportTSV tool and you are giving
"importtsv.bulk.output"  , it will go with 1st way..  JFYI..  Have a look
at ImportTSV tool documentation.

-Anoop-

On Sat, Aug 24, 2013 at 4:10 AM, Gautam Borah <gautam.borah@gmail.com>wrote:

> Thanks Ted for your response, and clarifying the behavior for using HTable
> interface.
>
> What would be the behavior for inserting data using map reduce job? would
> the recently added records be in the memstore? or I need to load them for
> read queries after the insert is done?
>
> Thanks,
> Gautam
>
>
> On Fri, Aug 23, 2013 at 2:43 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Assuming you are using 0.94, the default value
> > for hbase.regionserver.global.memstore.lowerLimit is 0.35
> >
> > Meaning, memstore on each region server would be able to hold 3000M *
> 0.35
> > / 60 = 17.5 mil records (roughly).
> >
> > bq. If I use HTable interface, would the inserted data be in the HBase
> > cache, before flushing to the files, for immediate read queries?
> >
> > Yes.
> >
> > Cheers
> >
> >
> > On Fri, Aug 23, 2013 at 12:01 PM, Gautam Borah <gautam.borah@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > Average size of my records is 60 bytes - 20 bytes Key and 40 bytes
> value,
> > > table has one column family.
> > >
> > > I have setup a cluster for testing - 1 master and 3 region servers.
> Each
> > > have a heap size of 3 GB, single cpu.
> > >
> > > I have pre-split the table into 30 regions. I do not have to keep data
> > > forever, I could purge older records periodically.
> > >
> > > Thanks,
> > >
> > > Gautam
> > >
> > >
> > >
> > > On Fri, Aug 23, 2013 at 3:20 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> > >
> > > > Can you tell us the average size of your records and how much heap is
> > > > given to the region servers ?
> > > >
> > > > Thanks
> > > >
> > > > On Aug 23, 2013, at 12:11 AM, Gautam Borah <gautam.borah@gmail.com>
> > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I have an use case where I need to write 1 million to 10 million
> > > records
> > > > > periodically (with intervals of 1 minutes to 10 minutes), into an
> > HBase
> > > > > table.
> > > > >
> > > > > Once the insert is completed, these records are queried immediately
> > > from
> > > > > another program - multiple reads.
> > > > >
> > > > > So, this is one massive write followed by many reads.
> > > > >
> > > > > I have two approaches to insert these records into the HBase table
> -
> > > > >
> > > > > Use HTable or HTableMultiplexer to stream the data to HBase table.
> > > > >
> > > > > or
> > > > >
> > > > > Write the data to HDFS store as a sequence file (avro in my case) -
> > run
> > > > map
> > > > > reduce job using HFileOutputFormat and then load the output files
> > into
> > > > > HBase cluster.
> > > > > Something like,
> > > > >
> > > > >  LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
> > > > >  loader.doBulkLoad(new Path(outputDir), hTable);
> > > > >
> > > > >
> > > > > In my use case which approach would be better?
> > > > >
> > > > > If I use HTable interface, would the inserted data be in the HBase
> > > cache,
> > > > > before flushing to the files, for immediate read queries?
> > > > >
> > > > > If I use map reduce job to insert, would the data be loaded into
> the
> > > > HBase
> > > > > cache immediately? or only the output files would be copied to
> > > respective
> > > > > hbase table specific directories?
> > > > >
> > > > > So, which approach is better for write and then immediate multiple
> > read
> > > > > operations?
> > > > >
> > > > > Thanks,
> > > > > Gautam
> > > >
> > >
> >
>

--047d7b621fb211636f04e4aa5532--