hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject Re: Multiple Input Paths
Date Wed, 04 Nov 2009 06:50:16 GMT
Hi Mark,
A future release of Hadoop will have a MultipleInputs class, akin to MultipleOutputs. This
would allow you to have a different inputformat, mapper depending on the path you are getting
the split from. It uses special Delegating[mapper/input] classes to resolve this. I understand
backporting this is more or less out of question, but the ideas there might provide pointers
to help you solve your current problem.
Just a thought :)


On 11/3/09 8:44 PM, "Mark Vigeant" <mark.vigeant@riskmetrics.com> wrote:

Hey Vipul

No I haven't concatenated my files yet, and I was just thinking over how to approach the issue
of multiple input paths.

I actually did what Amandeep hinted at which was we wrote our own XMLInputFormat and XMLRecordReader.
When configuring the job in my driver I set job.setInputFormatClass(XMLFileInputFormat.class)
and what it does is send chunks of XML to the mapper as opposed to lines of text or whole
files. So I specified the Line Delimiter in the XMLRecordReader (ie <startTag>) and
everything in between the tags <startTag> and </startTag> are sent to the mapper.
Inside the map function is where to parse the data and write it to the table.

What I have to do now is just figure out how to set the Line Delimiter to be something common
in both XML files I'm reading. Currently I have 2 mapper classes and thus 2 submitted jobs
which is really inefficient and time consuming.

Make sense at all? Sorry if it doesn't, feel free to ask more questions


-----Original Message-----
From: Vipul Sharma [mailto:sharmavipul@gmail.com]
Sent: Monday, November 02, 2009 7:48 PM
To: common-user@hadoop.apache.org
Subject: RE: Multiple Input Paths


were you able to concatenate both the xml files together. What did you do to
keep the resulting xml well forned?

Vipul Sharma,
Cell: 281-217-0761

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message