Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3DEDD10DBF for ; Sat, 24 Aug 2013 04:56:03 +0000 (UTC) Received: (qmail 7874 invoked by uid 500); 24 Aug 2013 04:56:00 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 7563 invoked by uid 500); 24 Aug 2013 04:55:56 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 7555 invoked by uid 99); 24 Aug 2013 04:55:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Aug 2013 04:55:55 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of anoop.hbase@gmail.com designates 209.85.128.50 as permitted sender) Received: from [209.85.128.50] (HELO mail-qe0-f50.google.com) (209.85.128.50) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Aug 2013 04:55:50 +0000 Received: by mail-qe0-f50.google.com with SMTP id s14so767883qeb.23 for ; Fri, 23 Aug 2013 21:55:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=aNfasecNA/YC5IvAb4GT1pG2S/rm+qHXimuFb2QrcYc=; b=qnzRAhPAaO8YcDT9eGTURZ0qq7YBBE/006fpdDOttU1HEaXKCe/Mz3QHbGmg41UOij 50d8Q2sDBL4JX2blXo4WWTmD03qdMsxRPzHwPb94aBb6z56az2oAj8xbKQnyB+o5MDnI yOnG91FVAhBFejwoH/2mte/FUKCKUnn378/joIHhdJg1j1MDARh6QtbJYUEXfhOWN5T7 dhN5yYFsISOnx+ZUs9MiV92DfyHmDHu/ZCcxfV75ezU+LFdfoS8orvEcoYhKXhxxsLpS YyO21rBaCLXQo5zvqvxisnjiiFsL2TSl9cnYRMpDpI/l5OhJB4FhqMOQqTAOVHnakfsC JhPQ== MIME-Version: 1.0 X-Received: by 10.49.82.43 with SMTP id f11mr3823801qey.26.1377320129616; Fri, 23 Aug 2013 21:55:29 -0700 (PDT) Received: by 10.49.49.97 with HTTP; Fri, 23 Aug 2013 21:55:29 -0700 (PDT) In-Reply-To: References: <1B331809-0487-403C-AAE1-7A635DECB230@gmail.com> Date: Sat, 24 Aug 2013 10:25:29 +0530 Message-ID: Subject: Re: best approach for write and immediate read use case From: Anoop John To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=047d7b621fb211636f04e4aa5532 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b621fb211636f04e4aa5532 Content-Type: text/plain; charset=ISO-8859-1 >What would be the behavior for inserting data using map reduce job? would the recently added records be in the memstore? or I need to load them for read queries after the insert is done? Using MR you have 2 options for insertion. One will create the HFiles directly as o/p (Using HFileOutputFormat) Here there is no memstore coming into picture. In the other one there will be calls to HTable#put() from mappers. Here memstore will come into picture.(These are mapper alone jobs) When you are using ImportTSV tool and you are giving "importtsv.bulk.output" , it will go with 1st way.. JFYI.. Have a look at ImportTSV tool documentation. -Anoop- On Sat, Aug 24, 2013 at 4:10 AM, Gautam Borah wrote: > Thanks Ted for your response, and clarifying the behavior for using HTable > interface. > > What would be the behavior for inserting data using map reduce job? would > the recently added records be in the memstore? or I need to load them for > read queries after the insert is done? > > Thanks, > Gautam > > > On Fri, Aug 23, 2013 at 2:43 PM, Ted Yu wrote: > > > Assuming you are using 0.94, the default value > > for hbase.regionserver.global.memstore.lowerLimit is 0.35 > > > > Meaning, memstore on each region server would be able to hold 3000M * > 0.35 > > / 60 = 17.5 mil records (roughly). > > > > bq. If I use HTable interface, would the inserted data be in the HBase > > cache, before flushing to the files, for immediate read queries? > > > > Yes. > > > > Cheers > > > > > > On Fri, Aug 23, 2013 at 12:01 PM, Gautam Borah > >wrote: > > > > > Hi, > > > > > > Average size of my records is 60 bytes - 20 bytes Key and 40 bytes > value, > > > table has one column family. > > > > > > I have setup a cluster for testing - 1 master and 3 region servers. > Each > > > have a heap size of 3 GB, single cpu. > > > > > > I have pre-split the table into 30 regions. I do not have to keep data > > > forever, I could purge older records periodically. > > > > > > Thanks, > > > > > > Gautam > > > > > > > > > > > > On Fri, Aug 23, 2013 at 3:20 AM, Ted Yu wrote: > > > > > > > Can you tell us the average size of your records and how much heap is > > > > given to the region servers ? > > > > > > > > Thanks > > > > > > > > On Aug 23, 2013, at 12:11 AM, Gautam Borah > > > wrote: > > > > > > > > > Hello all, > > > > > > > > > > I have an use case where I need to write 1 million to 10 million > > > records > > > > > periodically (with intervals of 1 minutes to 10 minutes), into an > > HBase > > > > > table. > > > > > > > > > > Once the insert is completed, these records are queried immediately > > > from > > > > > another program - multiple reads. > > > > > > > > > > So, this is one massive write followed by many reads. > > > > > > > > > > I have two approaches to insert these records into the HBase table > - > > > > > > > > > > Use HTable or HTableMultiplexer to stream the data to HBase table. > > > > > > > > > > or > > > > > > > > > > Write the data to HDFS store as a sequence file (avro in my case) - > > run > > > > map > > > > > reduce job using HFileOutputFormat and then load the output files > > into > > > > > HBase cluster. > > > > > Something like, > > > > > > > > > > LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf); > > > > > loader.doBulkLoad(new Path(outputDir), hTable); > > > > > > > > > > > > > > > In my use case which approach would be better? > > > > > > > > > > If I use HTable interface, would the inserted data be in the HBase > > > cache, > > > > > before flushing to the files, for immediate read queries? > > > > > > > > > > If I use map reduce job to insert, would the data be loaded into > the > > > > HBase > > > > > cache immediately? or only the output files would be copied to > > > respective > > > > > hbase table specific directories? > > > > > > > > > > So, which approach is better for write and then immediate multiple > > read > > > > > operations? > > > > > > > > > > Thanks, > > > > > Gautam > > > > > > > > > > --047d7b621fb211636f04e4aa5532--