hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Seetharam Venkatesh <venkat...@innerzeal.com>
Subject Re: Large input files via HTTP
Date Wed, 24 Oct 2012 16:25:06 GMT
That should work as well.

On Tue, Oct 23, 2012 at 9:06 PM, David Parks <davidparks21@yahoo.com> wrote:

> I might very well be overthinking this. But I have a cluster I’m firing up
> on EC2 that I want to keep utilized. I have some other unrelated jobs that
> don’t need to wait  for the downloads, so I don’t want to fill all the map
> slots with long downloads. I’d rather the other jobs run in parallel while
> the downloads are happening.****
>
> ** **
>
> ** **
>
> *From:* vseetharam@gmail.com [mailto:vseetharam@gmail.com] *On Behalf Of *Seetharam
> Venkatesh
> *Sent:* Tuesday, October 23, 2012 1:10 PM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large input files via HTTP****
>
> ** **
>
> Well, it depends. :-)  If the XML cannot be split, then you'd end up with
> only one map task for the entire set of files. I think it'd make sense to
> have multiple splits so you can get en even spread of copy across maps,
> retry only the failed copy and not manage the scheduling of the downloads.
> ****
>
> ** **
>
> Look at DistCp for some intelligent splitting. ****
>
> ** **
>
> What are the constraints that you are working with? ****
>
> On Mon, Oct 22, 2012 at 5:59 PM, David Parks <davidparks21@yahoo.com>
> wrote:****
>
> Would it make sense to write a map job that takes an unsplittable XML file
> (which defines all of the files I need to download); that one map job then
> kicks off the downloads in multiple threads. This way I can easily manage
> the most efficient download pattern within the map job, and my output is
> emitted as key,values straight to the reducer step?****
>
>  ****
>
>  ****
>
> *From:* vseetharam@gmail.com [mailto:vseetharam@gmail.com] *On Behalf Of *Seetharam
> Venkatesh
> *Sent:* Tuesday, October 23, 2012 7:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large input files via HTTP****
>
>  ****
>
> One possible way is to first create a list of files with tuples<host:port,
> filePath>. Then use a map-only job to pull each file using NLineInputFormat.
> ****
>
>  ****
>
> Another way is to write a HttpInputFormat and HttpRecordReader and stream
> the data in a map-only job.****
>
> On Mon, Oct 22, 2012 at 1:54 AM, David Parks <davidparks21@yahoo.com>
> wrote:****
>
> I want to create a MapReduce job which reads many multi-gigabyte input
> files
> from various HTTP sources & processes them nightly.
>
> Is there a reasonably flexible way to acquire the files in the Hadoop job
> its self? I expect the initial downloads to take many hours and I'd hope I
> can optimize the # of connections (example: I'm limited to 5 connections to
> one host, whereas another host has a 3-connection limit, so maximize as
> much
> as possible).  Also the set of files to download will change a little over
> time so the input list should be easily configurable (in a config file or
> equivalent).
>
>  - Is it normal to perform batch downloads like this *before* running the
> mapreduce job?
>  - Or is it ok to include such steps in with the job?
>  - It seems handy to keep the whole process as one neat package in Hadoop
> if
> possible.
>  - What class should I implement if I wanted to manage this myself? Would I
> just extend TextInputFormat for example, and do the HTTP processing there?
> Or am I created a FileSystem?
>
> Thanks,
> David****
>
>
>
> ****
>
>  ****
>
> --
> Regards,
> Venkatesh****
>
>  ****
>
> “Perfection (in design) is achieved not when there is nothing more to add,
> but rather when there is nothing more to take away.” ****
>
> - Antoine de Saint-Exupéry****
>
>  ****
>
> ** **
>



-- 
Regards,
Venkatesh

Phone: (408) 658-8368
EMail: Venkatesh@InnerZeal.com

http://in.linkedin.com/in/seetharamvenkatesh
http://about.me/SeetharamVenkatesh

“Perfection (in design) is achieved not when there is nothing more to add,
but rather when there is nothing more to take away.”
- Antoine de Saint-Exupéry

Mime
View raw message