Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of amits@infolinks.com designates
 207.126.144.115 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJpoTUYctxYexLtx2yTO3ngJ2KaqJUqwgF-6jMSNFNz=T4P+Og@mail.gmail.com>
References: 
 <CAJpoTUYctxYexLtx2yTO3ngJ2KaqJUqwgF-6jMSNFNz=T4P+Og@mail.gmail.com>
Date: Thu, 21 Mar 2013 18:47:33 +0200
Message-ID: 
 <CAAMYKhoDoHE+XapMhzXDnC444tmn10ZGOwzLCPkVe+k1jk52Og@mail.gmail.com>
Subject: Re: How to prevent major compaction when doing bulk load
 provisioning?
From: Amit Sela <amits@infolinks.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=14dae9340717656a7c04d8721848

--14dae9340717656a7c04d8721848
Content-Type: text/plain; charset=ISO-8859-1

Did you try pre-splitting your table before bulk loading ?

On Thu, Mar 21, 2013 at 3:29 PM, Nicolas Seyvet <nicolas.seyvet@gmail.com>wrote:

> Hi,
>
> We are using code similar to
> https://github.com/jrkinley/hbase-bulk-import-example/ in order to
> benchmark our HBase cluster.  We are running a CDH4 installation, and HBase
> is version 0.92.1-cdh4.1.1..  The cluster is composed of 12 slaves and 1
> master and 1 secondary master.
>
> During the bulk load insert, roughly within 3 hours after the start
> (~200Gb), we notice a large drop in performance in the insert rate. At the
> same time, there is a spike in IO and CPU usage.  Connecting to a Region
> Server (RS), the Monitored Task section shows that a compaction is started.
>
> I have set hbase.hregion.max.filesize to 107374182400 (100Gb), and disable
> automatic major compaction hbase.hregion.majorcompactionis set to 0.
>
> What we are doing is that we have 1000 files of synthetic data (csv), where
> each row in a file is one row to insert into HBase, each file contains 600K
> rows (or 600K events).  Our loader works in the following way:
> 1. Look for a file
> 2. When a file is found, prepare a job for that file
> 3. Launch job
> 4. Wait for completion
> 5. Compute insert rate (nb of rows /time)
> 6. Repeat from 1 until there are no more files.
>
> What I understand of the bulk load M/R job is that it produces one HFile
> for each Region.
>
> Questions:
> - How is HStoreFileSize calclulated?
> - What do HStoreFileSize, storeFileSize and hbase.hregion.max.filesize have
> in common?
> - Can the number of HFiles trigger a major compaction?
>
> Thx for help.  I hope my questions make sense.
>
> /Nicolas
>

--14dae9340717656a7c04d8721848--