Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 01534D111 for ; Mon, 11 Feb 2013 18:22:39 +0000 (UTC) Received: (qmail 13562 invoked by uid 500); 11 Feb 2013 18:22:35 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 13479 invoked by uid 500); 11 Feb 2013 18:22:35 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 13217 invoked by uid 500); 11 Feb 2013 18:22:35 -0000 Delivered-To: apmail-hadoop-hbase-user@hadoop.apache.org Received: (qmail 13206 invoked by uid 99); 11 Feb 2013 18:22:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Feb 2013 18:22:35 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mailinglists19@gmail.com designates 209.85.214.43 as permitted sender) Received: from [209.85.214.43] (HELO mail-bk0-f43.google.com) (209.85.214.43) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Feb 2013 18:22:26 +0000 Received: by mail-bk0-f43.google.com with SMTP id jm19so2717526bkc.16 for ; Mon, 11 Feb 2013 10:22:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=1tMBMf7yBy1PZCn+WhodKSuMkBhE0/35XWUmHGovbS8=; b=Sq1hFQJT6VKTsPz1mH8gVLb7Pz1MShdbR0pcskpbscReCy7bVU8j1jsCDzCIFc7u35 gBlJv6vUXsjdnf9UvM8GQGUWtWSfWduOYMwy0YzOh53LLAT5ocokHGvxTUEf1qiIBSZp PdgE3K3tzU9fy1M7rmHNbgcD8VU0TCUsToCTnVbrZ5ucDN4O10hdkgR2gCf3vc2wNKLE a8CMt76DO1AA8T4ULMbH4tqc/ovc4UkpoGQBd3KyWfEfvFSOrxfembXWSjFITUlheV/P W5F99xrmD219jtcWNHWdfYgSw4eXCG4FmSQIb9k91TvHl0+W5w4rHt2VkjXW39F9ruAY i+Kw== MIME-Version: 1.0 X-Received: by 10.204.9.21 with SMTP id j21mr4282142bkj.32.1360606925638; Mon, 11 Feb 2013 10:22:05 -0800 (PST) Received: by 10.205.40.70 with HTTP; Mon, 11 Feb 2013 10:22:05 -0800 (PST) Date: Mon, 11 Feb 2013 10:22:05 -0800 Message-ID: Subject: Loader for small files From: Something Something To: user@pig.apache.org, hbase-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0015175885587b4c6304d576fce5 X-Virus-Checked: Checked by ClamAV on apache.org --0015175885587b4c6304d576fce5 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Hello, We are running into performance issues with Pig/Hadoop because our input files are small. Everything goes to only 1 Mapper. To get around this, we are trying to use our own Loader like this: 1) Extend PigStorage: public class SmallFileStorage extends PigStorage { public SmallFileStorage(String delimiter) { super(delimiter); } @Override public InputFormat getInputFormat() { return new NLineInputFormat(); } } 2) Add command line argument to the Pig command as follows: -Dmapreduce.input.lineinputformat.linespermap=3D500000 3) Use SmallFileStorage in the Pig script as follows: USING com.xxx.yyy.SmallFileStorage ('\t') But this doesn't seem to work. We still see that everything is going to one mapper. Before we spend any more time on this, I am wondering if this is a good approach =96 OR =96 if there's a better approach? Please let me know. Thanks. --0015175885587b4c6304d576fce5--