Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of sshi@gopivotal.com designates
 209.85.192.182 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAMehEtrfLaz8i0YOVig2_2RRHvzzUhUp0wjkFHHY2DuCX9-L9g@mail.gmail.com>
References: <00a701cf9a3a$5f451fa0$1dcf5ee0$@gmail.com>
	<COL126-DS119363F43B0A2DE62D6900980D0@phx.gbl>
	<CAMehEtrfLaz8i0YOVig2_2RRHvzzUhUp0wjkFHHY2DuCX9-L9g@mail.gmail.com>
Date: Wed, 9 Jul 2014 16:15:48 +0800
Message-ID: 
 <CAOAr05uCCJ8hC0_+qLe1DP==T45s_LLVY41zG9rFsQWyCwCwsQ@mail.gmail.com>
Subject: Re: Huge text file for Hadoop Mapreduce
From: Stanley Shi <sshi@gopivotal.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=047d7b6dd020dcf9f404fdbe5080

--047d7b6dd020dcf9f404fdbe5080
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

You can get the wikipedia data from it's website, it's pretty big;

Regards,
*Stanley Shi,*


On Tue, Jul 8, 2014 at 1:35 PM, Du Lam <delim123456@gmail.com> wrote:

> Configuration conf =3D getConf();
> conf.setLong("mapreduce.input.fileinputformat.split.maxsize",10000000);
>
> // u can set this to some small value (in bytes)  to ensure your file wil=
l
> split to multiple mappers , provided the format is not un-splitable forma=
t
> like .snappy.
>
>
> On Tue, Jul 8, 2014 at 7:32 AM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   http://www.cs.cmu.edu/~./enron/
>>
>> Not sure the uncompressed size but pretty sure it=E2=80=99s over a Gig.
>>
>> B.
>>
>>  *From:* navaz <navaz.enc@gmail.com>
>> *Sent:* Monday, July 07, 2014 6:22 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Huge text file for Hadoop Mapreduce
>>
>>
>> Hi
>>
>>
>>
>> I am running basic word count Mapreduce code.  I have download a file
>> Gettysburg.txt which is of 1486bytes.  I have 3 datanodes and replicatio=
n
>> factor is set to 3. The data is copied into all 3 datanodes but there is
>> only one map task is running . All other nodes are ideal. I think this i=
s
>> because I have only one block of data and single task is running. I woul=
d
>> like to download a bigger file say 1GB and want to test the network
>> shuffling performance. Could you please suggest me where can I download =
the
>> huge text file. ?
>>
>>
>>
>>
>>
>> Thanks & Regards
>>
>>
>>
>> Abdul Navaz
>>
>>
>>
>
>

--047d7b6dd020dcf9f404fdbe5080
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">You can get the wikipedia data from it&#39;s website, it&#=
39;s pretty big;</div><div class=3D"gmail_extra"><br clear=3D"all"><div><di=
v dir=3D"ltr"><div>Regards,</div><div><b>Stanley Shi,</b></div><img src=3D"=
http://www.gopivotal.com/files/media/logos/pivotal-logo-email-signature.png=
"><br>
</div></div>
<br><br><div class=3D"gmail_quote">On Tue, Jul 8, 2014 at 1:35 PM, Du Lam <=
span dir=3D"ltr">&lt;<a href=3D"mailto:delim123456@gmail.com" target=3D"_bl=
ank">delim123456@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">
<div dir=3D"ltr">Configuration conf =3D getConf();<br><div>conf.setLong(&qu=
ot;mapreduce.input.fileinputformat.split.maxsize&quot;,10000000); =C2=A0=C2=
=A0</div><div><br></div><div>// u can set this to some small value (in byte=
s) =C2=A0to ensure your file will split to multiple mappers , provided the =
format is not un-splitable format like .snappy.<br>

</div></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_ext=
ra"><br><br><div class=3D"gmail_quote">On Tue, Jul 8, 2014 at 7:32 AM, Adar=
yl &quot;Bob&quot; Wakefield, MBA <span dir=3D"ltr">&lt;<a href=3D"mailto:a=
daryl.wakefield@hotmail.com" target=3D"_blank">adaryl.wakefield@hotmail.com=
</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div lang=3D"EN-US" dir=3D"ltr" link=3D"#0563c1" vlink=3D"#954f72">
<div dir=3D"ltr">
<div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#000000">
<div><a title=3D"http://www.cs.cmu.edu/~./enron/" href=3D"http://www.cs.cmu=
.edu/~./enron/" target=3D"_blank">http://www.cs.cmu.edu/~./enron/</a></div>
<div>=C2=A0</div>
<div>Not sure the uncompressed size but pretty sure it=E2=80=99s over a Gig=
.</div>
<div>=C2=A0</div>
<div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#000000">B=
.</div>
<div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Calibr=
i&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inline">
<div style=3D"FONT:10pt tahoma">
<div>=C2=A0</div>
<div style=3D"BACKGROUND:#f5f5f5">
<div><b>From:</b> <a title=3D"navaz.enc@gmail.com" href=3D"mailto:navaz.enc=
@gmail.com" target=3D"_blank">navaz</a> </div>
<div><b>Sent:</b> Monday, July 07, 2014 6:22 PM</div>
<div><b>To:</b> <a title=3D"user@hadoop.apache.org" href=3D"mailto:user@had=
oop.apache.org" target=3D"_blank">user@hadoop.apache.org</a> </div>
<div><b>Subject:</b> Huge text file for Hadoop Mapreduce</div></div></div>
<div>=C2=A0</div></div><div><div>
<div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Calibr=
i&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inline">
<div>
<p class=3D"MsoNormal"><span style=3D"COLOR:#1f4e79">Hi<u></u><u></u></span=
></p>
<p class=3D"MsoNormal"><span style=3D"COLOR:#1f4e79"><u></u><u></u></span>=
=C2=A0</p>
<p class=3D"MsoNormal"><span style=3D"COLOR:#1f4e79">I am running basic wor=
d count=20
Mapreduce code.=C2=A0 I have download a file Gettysburg.txt which is of=20
1486bytes.=C2=A0 I have 3 datanodes and replication factor is set to 3. The=
 data=20
is copied into all 3 datanodes but there is only one map task is running . =
All=20
other nodes are ideal. I think this is because I have only one block of dat=
a and=20
single task is running. I would like to download a bigger file say 1GB and =
want=20
to test the network shuffling performance. Could you please suggest me wher=
e can=20
I download the huge text file. ?<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"COLOR:#1f4e79"><u></u><u></u></span>=
=C2=A0</p>
<p class=3D"MsoNormal"><span style=3D"COLOR:#1f4e79"><u></u><u></u></span>=
=C2=A0</p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&quot;Cambria&quot;,&quot=
;serif&quot;;COLOR:#002060">Thanks &amp;=20
Regards<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&quot;Cambria&quot;,&quot=
;serif&quot;;COLOR:#002060"><u></u><u></u></span>=C2=A0</p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&quot;Cambria&quot;,&quot=
;serif&quot;;COLOR:#002060">Abdul=20
Navaz<u></u><u></u></span></p>
<p class=3D"MsoNormal"><u></u><u></u>=C2=A0</p></div></div></div></div></di=
v></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--047d7b6dd020dcf9f404fdbe5080--