Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E6353F09C for ; Thu, 21 Mar 2013 15:16:03 +0000 (UTC) Received: (qmail 15689 invoked by uid 500); 21 Mar 2013 15:16:01 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 15640 invoked by uid 500); 21 Mar 2013 15:16:01 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Delivered-To: moderator for user@hbase.apache.org Received: (qmail 67672 invoked by uid 99); 21 Mar 2013 13:30:14 -0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nicolas.seyvet@gmail.com designates 209.85.220.169 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=glwOtcKHAXQ9sNapVdQaRnLek9U1PM0tA2WUhL+VF3Q=; b=VFBB9g+DW8+HH3ZEL39MNZuTxPsnFMqOuOdqdApk7RZ3tt5GHK48yiL2bDELa+do4n iIUZrx7iLtFTRW5MD/gwRUCUB1y9VibpXuqpF456K//45rV2G0jFBdAkNeko/okMSv0e SoAu8rWqUPdAFaSVqFdkh/rXVc8NDM92IGwCR2n/g8Xt44QCJ8xOlLh7Kd2yBGoqsjaZ a/B1e50bXYATjazTVOKpyKXnw5RtvTX4E+FmSFgUQXkoBl0z3ajXpwXXabQUAc+TjJve MUOwKwjja8dxZagZeEzwvkBl67bKnQWZ8NehR8vfyTULwZwoONWnew2A7sXDSA2BXYT+ hh2Q== MIME-Version: 1.0 X-Received: by 10.52.98.5 with SMTP id ee5mr11342892vdb.102.1363872588146; Thu, 21 Mar 2013 06:29:48 -0700 (PDT) Date: Thu, 21 Mar 2013 14:29:47 +0100 Message-ID: Subject: How to prevent major compaction when doing bulk load provisioning? From: Nicolas Seyvet To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf307f34a622ae5904d86f5583 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307f34a622ae5904d86f5583 Content-Type: text/plain; charset=ISO-8859-1 Hi, We are using code similar to https://github.com/jrkinley/hbase-bulk-import-example/ in order to benchmark our HBase cluster. We are running a CDH4 installation, and HBase is version 0.92.1-cdh4.1.1.. The cluster is composed of 12 slaves and 1 master and 1 secondary master. During the bulk load insert, roughly within 3 hours after the start (~200Gb), we notice a large drop in performance in the insert rate. At the same time, there is a spike in IO and CPU usage. Connecting to a Region Server (RS), the Monitored Task section shows that a compaction is started. I have set hbase.hregion.max.filesize to 107374182400 (100Gb), and disable automatic major compaction hbase.hregion.majorcompactionis set to 0. What we are doing is that we have 1000 files of synthetic data (csv), where each row in a file is one row to insert into HBase, each file contains 600K rows (or 600K events). Our loader works in the following way: 1. Look for a file 2. When a file is found, prepare a job for that file 3. Launch job 4. Wait for completion 5. Compute insert rate (nb of rows /time) 6. Repeat from 1 until there are no more files. What I understand of the bulk load M/R job is that it produces one HFile for each Region. Questions: - How is HStoreFileSize calclulated? - What do HStoreFileSize, storeFileSize and hbase.hregion.max.filesize have in common? - Can the number of HFiles trigger a major compaction? Thx for help. I hope my questions make sense. /Nicolas --20cf307f34a622ae5904d86f5583--