hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Seetharam Venkatesh <venkat...@innerzeal.com>
Subject Re: Large input files via HTTP
Date Tue, 23 Oct 2012 06:09:36 GMT
Well, it depends. :-)  If the XML cannot be split, then you'd end up with
only one map task for the entire set of files. I think it'd make sense to
have multiple splits so you can get en even spread of copy across maps,
retry only the failed copy and not manage the scheduling of the downloads.

Look at DistCp for some intelligent splitting.

What are the constraints that you are working with?

On Mon, Oct 22, 2012 at 5:59 PM, David Parks <davidparks21@yahoo.com> wrote:

> Would it make sense to write a map job that takes an unsplittable XML file
> (which defines all of the files I need to download); that one map job then
> kicks off the downloads in multiple threads. This way I can easily manage
> the most efficient download pattern within the map job, and my output is
> emitted as key,values straight to the reducer step?****
>
> ** **
>
> ** **
>
> *From:* vseetharam@gmail.com [mailto:vseetharam@gmail.com] *On Behalf Of *Seetharam
> Venkatesh
> *Sent:* Tuesday, October 23, 2012 7:28 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Large input files via HTTP****
>
> ** **
>
> One possible way is to first create a list of files with tuples<host:port,
> filePath>. Then use a map-only job to pull each file using NLineInputFormat.
> ****
>
> ** **
>
> Another way is to write a HttpInputFormat and HttpRecordReader and stream
> the data in a map-only job.****
>
> On Mon, Oct 22, 2012 at 1:54 AM, David Parks <davidparks21@yahoo.com>
> wrote:****
>
> I want to create a MapReduce job which reads many multi-gigabyte input
> files
> from various HTTP sources & processes them nightly.
>
> Is there a reasonably flexible way to acquire the files in the Hadoop job
> its self? I expect the initial downloads to take many hours and I'd hope I
> can optimize the # of connections (example: I'm limited to 5 connections to
> one host, whereas another host has a 3-connection limit, so maximize as
> much
> as possible).  Also the set of files to download will change a little over
> time so the input list should be easily configurable (in a config file or
> equivalent).
>
>  - Is it normal to perform batch downloads like this *before* running the
> mapreduce job?
>  - Or is it ok to include such steps in with the job?
>  - It seems handy to keep the whole process as one neat package in Hadoop
> if
> possible.
>  - What class should I implement if I wanted to manage this myself? Would I
> just extend TextInputFormat for example, and do the HTTP processing there?
> Or am I created a FileSystem?
>
> Thanks,
> David
>
> ****
>
>
>
> ****
>
> ** **
>
> --
> Regards,
> Venkatesh****
>
> ** **
>
> “Perfection (in design) is achieved not when there is nothing more to add,
> but rather when there is nothing more to take away.” ****
>
> - Antoine de Saint-Exupéry****
>
> ** **
>

Mime
View raw message