Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C010AD26F for ; Wed, 24 Oct 2012 04:07:17 +0000 (UTC) Received: (qmail 71211 invoked by uid 500); 24 Oct 2012 04:07:13 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 71047 invoked by uid 500); 24 Oct 2012 04:07:12 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 71032 invoked by uid 99); 24 Oct 2012 04:07:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 04:07:12 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.139.212.187] (HELO nm28.bullet.mail.bf1.yahoo.com) (98.139.212.187) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Oct 2012 04:07:03 +0000 Received: from [98.139.212.149] by nm28.bullet.mail.bf1.yahoo.com with NNFMP; 24 Oct 2012 04:06:42 -0000 Received: from [98.139.211.207] by tm6.bullet.mail.bf1.yahoo.com with NNFMP; 24 Oct 2012 04:06:42 -0000 Received: from [127.0.0.1] by smtp216.mail.bf1.yahoo.com with NNFMP; 24 Oct 2012 04:06:42 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1351051601; bh=v1X8mQP8S2zqw/UcNi+q/lRxe2DFMN0EPfO9Os/jY+c=; h=X-Yahoo-Newman-Id:X-Yahoo-Newman-Property:X-YMail-OSG:X-Yahoo-SMTP:Received:From:To:References:In-Reply-To:Subject:Date:Message-ID:MIME-Version:Content-Type:X-Mailer:Thread-Index:Content-Language; b=wbhBaALquTXBhtsMATJjKpj3ZVYPVFenlorWHROunrL8oNP6grvm64DEdTTwXvsQ5b7g9ZxiRvJ4Q2h9uTqqnIuN9lWZmi8dsw+U7HFqAjv1oOlOlh1oF9sMrTJ3aobW1ab3fm1dXjus8sbM30LUj0zdFXYtLqE97JY4XbkVq74= X-Yahoo-Newman-Id: 988058.55022.bm@smtp216.mail.bf1.yahoo.com X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: 4yS_elUVM1lp1KGRaH4N3Ahh_xkmsglAVAPACK.oQRe9.oI nyEHHNRb_.Vzcz3CFFCzNrElCV6_KQtWahnidh1WBiEE_dccIDQR3xMz2vMF m9e.qHfq.YjJ_DbrW8kh5G11V2VnHpiwCNQbUxFKdX4fOF8km.3e7N25pZ81 ZAfSFX6hqlF1hnbhGrnkuHen74feJNfmacmeLBLbawgFCJhDPrEwDU8279eZ FafrQBSmn1ysMPIDO6K.kUcZCXb6F9qXS6UgS39Uvtv067ZksKTgP84IDXMc lB8M9an52Ex55rykWMn2wLXcPvnUnR4cXJMkozeC8zXJe.MZEZeHpkSw.bcz 8d.qT6mXIROzJEOxG_7.Fx2T0VieJJvGr3oLg9K31N2BX7D4NnosE8jMrc7X BSnVBOlh3m2TtCzMuL2MlTuRdz4qjLRwMFpMNG7kHFyXe5fr9rHqlNWU6mU8 yOTAE3nU9feXvxyw- X-Yahoo-SMTP: k2gD1GeswBAV_JFpZm8dmpTCwr4ufTKOyA-- Received: from sattelite (davidparks21@113.161.75.108 with login) by smtp216.mail.bf1.yahoo.com with SMTP; 23 Oct 2012 21:06:41 -0700 PDT From: "David Parks" To: References: <00aa01cdb032$e0f35830$a2da0890$@yahoo.com> <01a201cdb0b9$9bbff7a0$d33fe6e0$@yahoo.com> In-Reply-To: Subject: RE: Large input files via HTTP Date: Wed, 24 Oct 2012 11:06:31 +0700 Message-ID: <035601cdb19c$f51dfc50$df59f4f0$@yahoo.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0357_01CDB1D7.A17EA910" X-Mailer: Microsoft Outlook 14.0 Thread-Index: AQFcQ0vGXXiTqPrrkht2GM8avaTfUQI6PQB2ASg56bYCPZc7O5h9sh0w Content-Language: en-us X-Virus-Checked: Checked by ClamAV on apache.org This is a multipart message in MIME format. ------=_NextPart_000_0357_01CDB1D7.A17EA910 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I might very well be overthinking this. But I have a cluster I=92m = firing up on EC2 that I want to keep utilized. I have some other unrelated jobs = that don=92t need to wait for the downloads, so I don=92t want to fill all = the map slots with long downloads. I=92d rather the other jobs run in parallel = while the downloads are happening. =20 =20 From: vseetharam@gmail.com [mailto:vseetharam@gmail.com] On Behalf Of Seetharam Venkatesh Sent: Tuesday, October 23, 2012 1:10 PM To: user@hadoop.apache.org Subject: Re: Large input files via HTTP =20 Well, it depends. :-) If the XML cannot be split, then you'd end up = with only one map task for the entire set of files. I think it'd make sense = to have multiple splits so you can get en even spread of copy across maps, retry only the failed copy and not manage the scheduling of the = downloads. =20 Look at DistCp for some intelligent splitting.=20 =20 What are the constraints that you are working with?=20 On Mon, Oct 22, 2012 at 5:59 PM, David Parks = wrote: Would it make sense to write a map job that takes an unsplittable XML = file (which defines all of the files I need to download); that one map job = then kicks off the downloads in multiple threads. This way I can easily = manage the most efficient download pattern within the map job, and my output is emitted as key,values straight to the reducer step? =20 =20 From: vseetharam@gmail.com [mailto:vseetharam@gmail.com] On Behalf Of Seetharam Venkatesh Sent: Tuesday, October 23, 2012 7:28 AM To: user@hadoop.apache.org Subject: Re: Large input files via HTTP =20 One possible way is to first create a list of files with = tuples. Then use a map-only job to pull each file using = NLineInputFormat. =20 Another way is to write a HttpInputFormat and HttpRecordReader and = stream the data in a map-only job. On Mon, Oct 22, 2012 at 1:54 AM, David Parks = wrote: I want to create a MapReduce job which reads many multi-gigabyte input = files from various HTTP sources & processes them nightly. Is there a reasonably flexible way to acquire the files in the Hadoop = job its self? I expect the initial downloads to take many hours and I'd hope = I can optimize the # of connections (example: I'm limited to 5 connections = to one host, whereas another host has a 3-connection limit, so maximize as = much as possible). Also the set of files to download will change a little = over time so the input list should be easily configurable (in a config file = or equivalent). - Is it normal to perform batch downloads like this *before* running = the mapreduce job? - Or is it ok to include such steps in with the job? - It seems handy to keep the whole process as one neat package in = Hadoop if possible. - What class should I implement if I wanted to manage this myself? = Would I just extend TextInputFormat for example, and do the HTTP processing = there? Or am I created a FileSystem? Thanks, David =20 --=20 Regards, Venkatesh =20 =93Perfection (in design) is achieved not when there is nothing more to = add, but rather when there is nothing more to take away.=94=20 - Antoine de Saint-Exup=E9ry =20 =20 ------=_NextPart_000_0357_01CDB1D7.A17EA910 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

I might very well be overthinking this. But I have a cluster = I’m firing up on EC2 that I want to keep utilized. I have some = other unrelated jobs that don’t need to wait=A0 for the downloads, = so I don’t want to fill all the map slots with long downloads. = I’d rather the other jobs run in parallel while the downloads are = happening.

 

 

From:= = vseetharam@gmail.com [mailto:vseetharam@gmail.com] On Behalf Of = Seetharam Venkatesh
Sent: Tuesday, October 23, 2012 1:10 = PM
To: user@hadoop.apache.org
Subject: Re: Large = input files via HTTP

 

Well, it = depends. :-)  If the XML cannot be split, then you'd end up with = only one map task for the entire set of files. I think it'd make sense = to have multiple splits so you can get en even spread of copy across = maps, retry only the failed copy and not manage the scheduling of the = downloads.

 

Look at DistCp for some intelligent = splitting. 

 

What are the constraints that you are = working with? 

On Mon, Oct = 22, 2012 at 5:59 PM, David Parks <davidparks21@yahoo.com> = wrote:

Would it make sense to write a map job that takes an unsplittable XML = file (which defines all of the files I need to download); that one map = job then kicks off the downloads in multiple threads. This way I can = easily manage the most efficient download pattern within the map job, = and my output is emitted as key,values straight to the reducer = step?

 

 

From:= = vseetharam@gmail.com [mailto:vseetharam@gmail.com] On Behalf Of = Seetharam Venkatesh
Sent: Tuesday, October 23, 2012 7:28 = AM
To: user@hadoop.apache.org
Subject: Re: = Large input files via HTTP

 <= /o:p>

One = possible way is to first create a list of files with = tuples<host:port, filePath>. Then use a map-only job to pull each = file using NLineInputFormat.

 <= /o:p>

Another way is to = write a HttpInputFormat and HttpRecordReader and stream the data in a = map-only job.

On Mon, Oct = 22, 2012 at 1:54 AM, David Parks <davidparks21@yahoo.com> wrote:

I want to create = a MapReduce job which reads many multi-gigabyte input files
from = various HTTP sources & processes them nightly.

Is there a = reasonably flexible way to acquire the files in the Hadoop job
its = self? I expect the initial downloads to take many hours and I'd hope = I
can optimize the # of connections (example: I'm limited to 5 = connections to
one host, whereas another host has a 3-connection = limit, so maximize as much
as possible).  Also the set of files = to download will change a little over
time so the input list should = be easily configurable (in a config file = or
equivalent).

 - Is it normal to perform batch = downloads like this *before* running the
mapreduce job?
 - Or = is it ok to include such steps in with the job?
 - It seems = handy to keep the whole process as one neat package in Hadoop = if
possible.
 - What class should I implement if I wanted to = manage this myself? Would I
just extend TextInputFormat for example, = and do the HTTP processing there?
Or am I created a = FileSystem?

Thanks,
David



 <= /o:p>

-- =
Regards,
V= enkatesh

 =

“Perfec= tion (in design) is achieved not when there is nothing more to add, but = rather when there is nothing more to take = away.” 

- Antoine de = Saint-Exup=E9ry

 <= /o:p>

 

------=_NextPart_000_0357_01CDB1D7.A17EA910--