Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1DDDCE13D for ; Mon, 11 Feb 2013 18:37:27 +0000 (UTC) Received: (qmail 64269 invoked by uid 500); 11 Feb 2013 18:37:22 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 64178 invoked by uid 500); 11 Feb 2013 18:37:22 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Delivered-To: moderator for user@hadoop.apache.org Received: (qmail 45615 invoked by uid 99); 11 Feb 2013 18:30:24 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=qU2K3k/+c9RrFDYN22XQABw8//0FVD8F3KDRhfv4Yrk=; b=J4UvPQHrhfE5or0m5sbCVJ+tUfhTbDPy6wFSL7G7q8tc1v3L4ZTIRwJrtej0Jra28i 2t3V993moz4m6qIlfWWSAmS7WsUC6MBXjP8/0Tc44jE5bGwihmTeCaG/XTAQxydv8Z4U By/z8jiZKFM6cdK8ytBU7hbBAIT+BOTUwiEAstDRdUR462a05ymsri6mmNd2rq8bKR/c pjhnYO3uzLPCZInJgrjZb9KoJ5JEGZCm/S9XlB5bzknPjNbD9YRchCIAFaTrO4R+m6d3 t4ps5IbH8mRstlgdItalzTV19E+niuSmgmZ4A1Uc5+eBOEhq96JIAWxr8gMQEAMl+ynV +VHg== X-Received: by 10.52.98.196 with SMTP id ek4mr18113810vdb.16.1360607393690; Mon, 11 Feb 2013 10:29:53 -0800 (PST) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Loader for small files From: David LaBarbera In-Reply-To: Date: Mon, 11 Feb 2013 13:29:56 -0500 Cc: user@hadoop.apache.org Content-Transfer-Encoding: quoted-printable Message-Id: <4300198D-0743-4D51-9840-5130FA9D5E7A@localresponse.com> References: To: user@pig.apache.org X-Mailer: Apple Mail (2.1499) X-Gm-Message-State: ALoCoQkUMY/Qe0XPcf919I2UXCfhH8kNpxkV2jOjQW0a8OaXWw74eqCsyjdNzhUTHRU6OdqG1aSx X-Virus-Checked: Checked by ClamAV on apache.org You could store your data in smaller block sizes. Do something like hadoop fs HADOOP_OPTS=3D"-Ddfs.block.size=3D1048576 = -Dfs.local.block.size=3D1048576" -cp /org-input /small-block-input You might only need one of those parameters. You can verify the block = size with hadoop fsck /small-block-input In your pig script, you'll probably need to set pig.maxCombinedSplitSize=20 to something around the block size David On Feb 11, 2013, at 1:24 PM, Something Something = wrote: > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related = to > HBase. Adding 'hadoop' user group. >=20 > On Mon, Feb 11, 2013 at 10:22 AM, Something Something < > mailinglists19@gmail.com> wrote: >=20 >> Hello, >>=20 >> We are running into performance issues with Pig/Hadoop because our = input >> files are small. Everything goes to only 1 Mapper. To get around = this, we >> are trying to use our own Loader like this: >>=20 >> 1) Extend PigStorage: >>=20 >> public class SmallFileStorage extends PigStorage { >>=20 >> public SmallFileStorage(String delimiter) { >> super(delimiter); >> } >>=20 >> @Override >> public InputFormat getInputFormat() { >> return new NLineInputFormat(); >> } >> } >>=20 >>=20 >>=20 >> 2) Add command line argument to the Pig command as follows: >>=20 >> -Dmapreduce.input.lineinputformat.linespermap=3D500000 >>=20 >>=20 >>=20 >> 3) Use SmallFileStorage in the Pig script as follows: >>=20 >> USING com.xxx.yyy.SmallFileStorage ('\t') >>=20 >>=20 >> But this doesn't seem to work. We still see that everything is going = to >> one mapper. Before we spend any more time on this, I am wondering if = this >> is a good approach =96 OR =96 if there's a better approach? Please = let me >> know. Thanks. >>=20 >>=20 >>=20