Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: "David Parks" <davidparks21@yahoo.com>
To: <user@hadoop.apache.org>
References: <00aa01cdb032$e0f35830$a2da0890$@yahoo.com>
	<CAFfn_Rp5VoQU4oQz9tFjTK=fgCDfCnpWkXCmCz2z3e9pZ6ut7Q@mail.gmail.com>
	<01a201cdb0b9$9bbff7a0$d33fe6e0$@yahoo.com>
 <CAFfn_Rpq-gOS0xFQ6JBMTLJXxZ4=QGm4=CzTueNrWYOX-JfHog@mail.gmail.com>
In-Reply-To: 
 <CAFfn_Rpq-gOS0xFQ6JBMTLJXxZ4=QGm4=CzTueNrWYOX-JfHog@mail.gmail.com>
Subject: RE: Large input files via HTTP
Date: Wed, 24 Oct 2012 11:06:31 +0700
Message-ID: <035601cdb19c$f51dfc50$df59f4f0$@yahoo.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_0357_01CDB1D7.A17EA910"
Thread-Index: AQFcQ0vGXXiTqPrrkht2GM8avaTfUQI6PQB2ASg56bYCPZc7O5h9sh0w
Content-Language: en-us

This is a multipart message in MIME format.

------=_NextPart_000_0357_01CDB1D7.A17EA910
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I might very well be overthinking this. But I have a cluster I=92m =
firing up
on EC2 that I want to keep utilized. I have some other unrelated jobs =
that
don=92t need to wait  for the downloads, so I don=92t want to fill all =
the map
slots with long downloads. I=92d rather the other jobs run in parallel =
while
the downloads are happening.

=20

=20

From: vseetharam@gmail.com [mailto:vseetharam@gmail.com] On Behalf Of
Seetharam Venkatesh
Sent: Tuesday, October 23, 2012 1:10 PM
To: user@hadoop.apache.org
Subject: Re: Large input files via HTTP

=20

Well, it depends. :-)  If the XML cannot be split, then you'd end up =
with
only one map task for the entire set of files. I think it'd make sense =
to
have multiple splits so you can get en even spread of copy across maps,
retry only the failed copy and not manage the scheduling of the =
downloads.

=20

Look at DistCp for some intelligent splitting.=20

=20

What are the constraints that you are working with?=20

On Mon, Oct 22, 2012 at 5:59 PM, David Parks <davidparks21@yahoo.com> =
wrote:

Would it make sense to write a map job that takes an unsplittable XML =
file
(which defines all of the files I need to download); that one map job =
then
kicks off the downloads in multiple threads. This way I can easily =
manage
the most efficient download pattern within the map job, and my output is
emitted as key,values straight to the reducer step?

=20

=20

From: vseetharam@gmail.com [mailto:vseetharam@gmail.com] On Behalf Of
Seetharam Venkatesh
Sent: Tuesday, October 23, 2012 7:28 AM
To: user@hadoop.apache.org
Subject: Re: Large input files via HTTP

=20

One possible way is to first create a list of files with =
tuples<host:port,
filePath>. Then use a map-only job to pull each file using =
NLineInputFormat.

=20

Another way is to write a HttpInputFormat and HttpRecordReader and =
stream
the data in a map-only job.

On Mon, Oct 22, 2012 at 1:54 AM, David Parks <davidparks21@yahoo.com> =
wrote:

I want to create a MapReduce job which reads many multi-gigabyte input =
files
from various HTTP sources & processes them nightly.

Is there a reasonably flexible way to acquire the files in the Hadoop =
job
its self? I expect the initial downloads to take many hours and I'd hope =
I
can optimize the # of connections (example: I'm limited to 5 connections =
to
one host, whereas another host has a 3-connection limit, so maximize as =
much
as possible).  Also the set of files to download will change a little =
over
time so the input list should be easily configurable (in a config file =
or
equivalent).

 - Is it normal to perform batch downloads like this *before* running =
the
mapreduce job?
 - Or is it ok to include such steps in with the job?
 - It seems handy to keep the whole process as one neat package in =
Hadoop if
possible.
 - What class should I implement if I wanted to manage this myself? =
Would I
just extend TextInputFormat for example, and do the HTTP processing =
there?
Or am I created a FileSystem?

Thanks,
David


=20

--=20
Regards,
Venkatesh

=20

=93Perfection (in design) is achieved not when there is nothing more to =
add,
but rather when there is nothing more to take away.=94=20

- Antoine de Saint-Exup=E9ry

=20

=20


------=_NextPart_000_0357_01CDB1D7.A17EA910
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<META HTTP-EQUIV=3D"Content-Type" CONTENT=3D"text/html; =
charset=3Diso-8859-1">
<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta name=3DGenerator =
content=3D"Microsoft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0in;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
span.EmailStyle17
	{mso-style-type:personal-reply;
	font-family:"Calibri","sans-serif";
	color:#1F497D;}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri","sans-serif";}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue =
vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>I might very well be overthinking this. But I have a cluster =
I&#8217;m firing up on EC2 that I want to keep utilized. I have some =
other unrelated jobs that don&#8217;t need to wait=A0 for the downloads, =
so I don&#8217;t want to fill all the map slots with long downloads. =
I&#8217;d rather the other jobs run in parallel while the downloads are =
happening.<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><b><span =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> =
vseetharam@gmail.com [mailto:vseetharam@gmail.com] <b>On Behalf Of =
</b>Seetharam Venkatesh<br><b>Sent:</b> Tuesday, October 23, 2012 1:10 =
PM<br><b>To:</b> user@hadoop.apache.org<br><b>Subject:</b> Re: Large =
input files via HTTP<o:p></o:p></span></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Well, it =
depends. :-) &nbsp;If the XML cannot be split, then you'd end up with =
only one map task for the entire set of files. I think it'd make sense =
to have multiple splits so you can get en even spread of copy across =
maps, retry only the failed copy and not manage the scheduling of the =
downloads.<o:p></o:p></p><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal>Look at DistCp for some intelligent =
splitting.&nbsp;<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=3DMsoNormal =
style=3D'margin-bottom:12.0pt'>What are the constraints that you are =
working with?&nbsp;<o:p></o:p></p><div><p class=3DMsoNormal>On Mon, Oct =
22, 2012 at 5:59 PM, David Parks &lt;<a =
href=3D"mailto:davidparks21@yahoo.com" =
target=3D"_blank">davidparks21@yahoo.com</a>&gt; =
wrote:<o:p></o:p></p><div><div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>Would it make sense to write a map job that takes an unsplittable XML =
file (which defines all of the files I need to download); that one map =
job then kicks off the downloads in multiple threads. This way I can =
easily manage the most efficient download pattern within the map job, =
and my output is emitted as key,values straight to the reducer =
step?</span><o:p></o:p></p><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>&nbsp;</span><o:p></o:p></p><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497=
D'>&nbsp;</span><o:p></o:p></p><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><b><span =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> =
<a href=3D"mailto:vseetharam@gmail.com" =
target=3D"_blank">vseetharam@gmail.com</a> [mailto:<a =
href=3D"mailto:vseetharam@gmail.com" =
target=3D"_blank">vseetharam@gmail.com</a>] <b>On Behalf Of =
</b>Seetharam Venkatesh<br><b>Sent:</b> Tuesday, October 23, 2012 7:28 =
AM<br><b>To:</b> <a href=3D"mailto:user@hadoop.apache.org" =
target=3D"_blank">user@hadoop.apache.org</a><br><b>Subject:</b> Re: =
Large input files via HTTP</span><o:p></o:p></p><div><div><p =
class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'>&nbsp;<o:p><=
/o:p></p><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'>One =
possible way is to first create a list of files with =
tuples&lt;host:port, filePath&gt;. Then use a map-only job to pull each =
file using NLineInputFormat.<o:p></o:p></p><div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'>&nbsp;<o:p><=
/o:p></p></div><div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;margin-bottom:12.0pt'>Another way is to =
write a HttpInputFormat and HttpRecordReader and stream the data in a =
map-only job.<o:p></o:p></p><div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'>On Mon, Oct =
22, 2012 at 1:54 AM, David Parks &lt;<a =
href=3D"mailto:davidparks21@yahoo.com" =
target=3D"_blank">davidparks21@yahoo.com</a>&gt; wrote:<o:p></o:p></p><p =
class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;margin-bottom:12.0pt'>I want to create =
a MapReduce job which reads many multi-gigabyte input files<br>from =
various HTTP sources &amp; processes them nightly.<br><br>Is there a =
reasonably flexible way to acquire the files in the Hadoop job<br>its =
self? I expect the initial downloads to take many hours and I'd hope =
I<br>can optimize the # of connections (example: I'm limited to 5 =
connections to<br>one host, whereas another host has a 3-connection =
limit, so maximize as much<br>as possible). &nbsp;Also the set of files =
to download will change a little over<br>time so the input list should =
be easily configurable (in a config file =
or<br>equivalent).<br><br>&nbsp;- Is it normal to perform batch =
downloads like this *before* running the<br>mapreduce job?<br>&nbsp;- Or =
is it ok to include such steps in with the job?<br>&nbsp;- It seems =
handy to keep the whole process as one neat package in Hadoop =
if<br>possible.<br>&nbsp;- What class should I implement if I wanted to =
manage this myself? Would I<br>just extend TextInputFormat for example, =
and do the HTTP processing there?<br>Or am I created a =
FileSystem?<br><br>Thanks,<br>David<o:p></o:p></p></div><p =
class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><br><br =
clear=3Dall><o:p></o:p></p><div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'>&nbsp;<o:p><=
/o:p></p></div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'>-- =
<br><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif"'>Regards,<br>V=
enkatesh</span><o:p></o:p></p><div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif"'>&nbsp;</span>=
<o:p></o:p></p></div><div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif"'>&#8220;Perfec=
tion (in design) is achieved not when there is nothing more to add, but =
rather when there is nothing more to take =
away.&#8221;&nbsp;</span><o:p></o:p></p></div><div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif"'>- Antoine de =
Saint-Exup=E9ry</span><o:p></o:p></p></div><p class=3DMsoNormal =
style=3D'mso-margin-top-alt:auto;mso-margin-bottom-alt:auto'>&nbsp;<o:p><=
/o:p></p></div></div></div></div></div></div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div></div></body></html>
------=_NextPart_000_0357_01CDB1D7.A17EA910--