Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E3D6D115BF for ; Wed, 9 Jul 2014 08:16:21 +0000 (UTC) Received: (qmail 26248 invoked by uid 500); 9 Jul 2014 08:16:16 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 26141 invoked by uid 500); 9 Jul 2014 08:16:16 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 26126 invoked by uid 99); 9 Jul 2014 08:16:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jul 2014 08:16:16 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sshi@gopivotal.com designates 209.85.192.182 as permitted sender) Received: from [209.85.192.182] (HELO mail-pd0-f182.google.com) (209.85.192.182) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jul 2014 08:16:14 +0000 Received: by mail-pd0-f182.google.com with SMTP id y13so8539180pdi.41 for ; Wed, 09 Jul 2014 01:15:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=DcZU5ksKLOHVAonoOVU1ksnJWNpDdiajyFWfHoyXLp8=; b=j5pH6qor1lS8T/y/03QehRdfksNu3kNv6Jj4ix3iTO0DVJS9vuPEH7urnzcOwTjlHj bgommhti+rzJsOFhF0Cq+EEFrKWx8ndwbftOKluCwAN9tEFi3LzFR2HnKQZfVH4aNJ27 Qo4DWHJotZkTOx0jdmWpMJgEs5/m5FFQ1uLANwWEYTNNV0+hqRBxbXfcbm+wYuq4v2wz 86tl1LvqkLaiFLWNDlQ43uDGX+/ro1499yN+5AKG3+49JuWTPtU6QxG7vK+Z4nYbutjN IC/3PGeIdXnjkBTUqx30CU4JoXijxnPKEIL/CKpxWgDNZeqV/F3XDIa4LIJdCpxYLajU ZWVQ== X-Gm-Message-State: ALoCoQk3WVNXHphmwvlVIqWWaXFbuHA7tsWyEUSPFgs2TKFQ7Q3pIHDGxi0MoccvU/7voVjk0JfE MIME-Version: 1.0 X-Received: by 10.70.21.201 with SMTP id x9mr9811042pde.7.1404893748947; Wed, 09 Jul 2014 01:15:48 -0700 (PDT) Received: by 10.70.10.164 with HTTP; Wed, 9 Jul 2014 01:15:48 -0700 (PDT) In-Reply-To: References: <00a701cf9a3a$5f451fa0$1dcf5ee0$@gmail.com> Date: Wed, 9 Jul 2014 16:15:48 +0800 Message-ID: Subject: Re: Huge text file for Hadoop Mapreduce From: Stanley Shi To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=047d7b6dd020dcf9f404fdbe5080 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b6dd020dcf9f404fdbe5080 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable You can get the wikipedia data from it's website, it's pretty big; Regards, *Stanley Shi,* On Tue, Jul 8, 2014 at 1:35 PM, Du Lam wrote: > Configuration conf =3D getConf(); > conf.setLong("mapreduce.input.fileinputformat.split.maxsize",10000000); > > // u can set this to some small value (in bytes) to ensure your file wil= l > split to multiple mappers , provided the format is not un-splitable forma= t > like .snappy. > > > On Tue, Jul 8, 2014 at 7:32 AM, Adaryl "Bob" Wakefield, MBA < > adaryl.wakefield@hotmail.com> wrote: > >> http://www.cs.cmu.edu/~./enron/ >> >> Not sure the uncompressed size but pretty sure it=E2=80=99s over a Gig. >> >> B. >> >> *From:* navaz >> *Sent:* Monday, July 07, 2014 6:22 PM >> *To:* user@hadoop.apache.org >> *Subject:* Huge text file for Hadoop Mapreduce >> >> >> Hi >> >> >> >> I am running basic word count Mapreduce code. I have download a file >> Gettysburg.txt which is of 1486bytes. I have 3 datanodes and replicatio= n >> factor is set to 3. The data is copied into all 3 datanodes but there is >> only one map task is running . All other nodes are ideal. I think this i= s >> because I have only one block of data and single task is running. I woul= d >> like to download a bigger file say 1GB and want to test the network >> shuffling performance. Could you please suggest me where can I download = the >> huge text file. ? >> >> >> >> >> >> Thanks & Regards >> >> >> >> Abdul Navaz >> >> >> > > --047d7b6dd020dcf9f404fdbe5080 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
You can get the wikipedia data from it's website, it&#= 39;s pretty big;

Regards,
Stanley Shi,
On Tue, Jul 8, 2014 at 1:35 PM, Du Lam <= span dir=3D"ltr"><delim123456@gmail.com> wrote:
Configuration conf =3D getConf();
conf.setLong(&qu= ot;mapreduce.input.fileinputformat.split.maxsize",10000000); =C2=A0=C2= =A0

// u can set this to some small value (in byte= s) =C2=A0to ensure your file will split to multiple mappers , provided the = format is not un-splitable format like .snappy.


On Tue, Jul 8, 2014 at 7:32 AM, Adar= yl "Bob" Wakefield, MBA <adaryl.wakefield@hotmail.com= > wrote:
=C2=A0
Not sure the uncompressed size but pretty sure it=E2=80=99s over a Gig= .
=C2=A0
B= .
=C2=A0
From: navaz
Sent: Monday, July 07, 2014 6:22 PM
Subject: Huge text file for Hadoop Mapreduce
=C2=A0

Hi

= =C2=A0

I am running basic wor= d count=20 Mapreduce code.=C2=A0 I have download a file Gettysburg.txt which is of=20 1486bytes.=C2=A0 I have 3 datanodes and replication factor is set to 3. The= data=20 is copied into all 3 datanodes but there is only one map task is running . = All=20 other nodes are ideal. I think this is because I have only one block of dat= a and=20 single task is running. I would like to download a bigger file say 1GB and = want=20 to test the network shuffling performance. Could you please suggest me wher= e can=20 I download the huge text file. ?

= =C2=A0

= =C2=A0

Thanks &=20 Regards

=C2=A0

Abdul=20 Navaz

=C2=A0



--047d7b6dd020dcf9f404fdbe5080--