Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 691254F35 for ; Tue, 31 May 2011 09:56:01 +0000 (UTC) Received: (qmail 52762 invoked by uid 500); 31 May 2011 09:56:00 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 52711 invoked by uid 500); 31 May 2011 09:56:00 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 52703 invoked by uid 99); 31 May 2011 09:56:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 May 2011 09:56:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of junxian.yan@gmail.com designates 209.85.210.48 as permitted sender) Received: from [209.85.210.48] (HELO mail-pz0-f48.google.com) (209.85.210.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 May 2011 09:55:54 +0000 Received: by pzk10 with SMTP id 10so2663967pzk.35 for ; Tue, 31 May 2011 02:55:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=2bQ6Ah3bjyncwtIpCr/j0WdO3xbUO0GqiTWKvYtngZ8=; b=V8pqQS8VjvP4llJ3mIZU8DguKW7vNXvO1o+mquVpH4YLW9ke1iZUvJLEk5oCehBRp/ 1QdqIhMlIyElfI2a9inCYBFluC54ZwSvha3IDoqRRWr8+qYDBjgiiJhRhatGjJTJjg+q yOp8YXgkDyGM18F5XHpHvphnjhCk83ctojfS0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=mpvVxx1wnI3lAIunr4JORUBUdSiJo6v5gNbfT42AnkcY2kinqYWd1igRypn1yqNRNW SwbtkcIVlPDn0UBgq0dbw5G0us9u2Q6xjsh8GhdkxPW8rsvznmTKZ8ubKSLgyYo+JHap ke8SNaq+ZwWRFfkoo2aR7cQxNz2dQdsgk4eow= MIME-Version: 1.0 Received: by 10.68.51.166 with SMTP id l6mr2381572pbo.87.1306835733954; Tue, 31 May 2011 02:55:33 -0700 (PDT) Received: by 10.143.18.15 with HTTP; Tue, 31 May 2011 02:55:33 -0700 (PDT) Date: Tue, 31 May 2011 02:55:33 -0700 Message-ID: Subject: question about number of map tasks for small file From: Junxian Yan To: user@hive.apache.org Content-Type: multipart/alternative; boundary=bcaec53963beb3bf7e04a48f677b --bcaec53963beb3bf7e04a48f677b Content-Type: text/plain; charset=ISO-8859-1 Hi Guys I use flume to store log file , and use hive to query. Flume always store the small file with suffix .seq Now I have over 35 thousand seq files. Every time when I launch query script, 35 thousand map tasks will be created and it's so long time to wait for completing. I also try to set CombineHiveInputFormat, but if I set this option, it seems the task will be executed slowly. Because total size of the data folder over 700M. Now in my testing env, I only have 3 data nodes. I also tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting, seems doesn't work. There's alway only one map task if set CombineHiveInputFormat. Can you plz show me a solution in which I can set map task number freely BTW: version for hadoop is 20 and hive is 0.5 Richard --bcaec53963beb3bf7e04a48f677b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Guys

I use flume to store log file , and use hive to = query.=A0

Flume always store the small file with s= uffix .seq Now I have over 35 thousand seq files. Every time when I launch = query script, 35 thousand map tasks will be created and it's so long ti= me to wait for completing.=A0

I also try to set=A0CombineHiveInputFormat, but if I se= t this option, it seems the task will be executed slowly. Because total siz= e of the data folder over 700M. =A0Now in my testing env, I only have 3 dat= a nodes. I also tried to add mapred.map.tasks=3D5 after the=A0CombineHiveIn= putFormat setting, seems doesn't work. There's alway only one map t= ask if set=A0CombineHiveInputFormat.

Can you plz show me a solution in which I can set map t= ask number freely

BTW: version for hadoop is 20 an= d hive is 0.5

Richard
--bcaec53963beb3bf7e04a48f677b--