hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom White <...@cloudera.com>
Subject Re: Multiple Input Paths
Date Mon, 09 Nov 2009 04:39:39 GMT
MultipleInputs is available from Hadoop 0.19 onwards (in
org.apache.hadoop.mapred.lib, or org.apache.hadoop.mapreduce.lib.input
for the new API in later versions).


On Wed, Nov 4, 2009 at 8:07 AM, Mark Vigeant
<mark.vigeant@riskmetrics.com> wrote:
> Amogh,
> That sounds so awesome! Yeah I wish I had that class now. Do you have any tips on how
to create such a delegating class? The best I can come up with is to just submit both files
to the mapper using multiple input paths and then having anif statement at the beginning of
the map that checks which file it's dealing with but I'm skeptical that I can even make that
work... Is there a way you know of that I could submit 2 mapper classes to the job?
> -----Original Message-----
> From: Amogh Vasekar [mailto:amogh@yahoo-inc.com]
> Sent: Wednesday, November 04, 2009 1:50 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Multiple Input Paths
> Hi Mark,
> A future release of Hadoop will have a MultipleInputs class, akin to MultipleOutputs.
This would allow you to have a different inputformat, mapper depending on the path you are
getting the split from. It uses special Delegating[mapper/input] classes to resolve this.
I understand backporting this is more or less out of question, but the ideas there might provide
pointers to help you solve your current problem.
> Just a thought :)
> Amogh
> On 11/3/09 8:44 PM, "Mark Vigeant" <mark.vigeant@riskmetrics.com> wrote:
> Hey Vipul
> No I haven't concatenated my files yet, and I was just thinking over how to approach
the issue of multiple input paths.
> I actually did what Amandeep hinted at which was we wrote our own XMLInputFormat and
XMLRecordReader. When configuring the job in my driver I set job.setInputFormatClass(XMLFileInputFormat.class)
and what it does is send chunks of XML to the mapper as opposed to lines of text or whole
files. So I specified the Line Delimiter in the XMLRecordReader (ie <startTag>) and
everything in between the tags <startTag> and </startTag> are sent to the mapper.
Inside the map function is where to parse the data and write it to the table.
> What I have to do now is just figure out how to set the Line Delimiter to be something
common in both XML files I'm reading. Currently I have 2 mapper classes and thus 2 submitted
jobs which is really inefficient and time consuming.
> Make sense at all? Sorry if it doesn't, feel free to ask more questions
> Mark
> -----Original Message-----
> From: Vipul Sharma [mailto:sharmavipul@gmail.com]
> Sent: Monday, November 02, 2009 7:48 PM
> To: common-user@hadoop.apache.org
> Subject: RE: Multiple Input Paths
> Mark,
> were you able to concatenate both the xml files together. What did you do to
> keep the resulting xml well forned?
> Regards,
> Vipul Sharma,
> Cell: 281-217-0761

View raw message