hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject Re: Hadoop with Multiple Inpus and Outputs
Date Thu, 03 Dec 2009 09:35:46 GMT
Hi,
Please try removing the combiner and running.
I know that if you use multiple outputs from within a mapper, those <k,v> pairs are
not a part of sort and shuffle phase. Your combiner is same as reducer which uses mos, and
might be an issue on map side. If I'm to take a guess, mos writes to a different file from
default map output, and the default key format is LongWritable. If nothing is written, maybe
this isnt modified? Just a thought.
For checking input file being consumed in current map task, you can use "map.input.file" from
job conf, instead of figuring it out from split name.

Amogh


On 12/3/09 12:17 PM, "James R. Leek" <leek2@llnl.gov> wrote:

I've been trying to figure out how to do a set difference in hadoop.  I
would like to take 2 file, and remove the values they have in common
between them.  Let's say I have two bags, 'students' and 'employees'.  I
want to find which students are just students, and which employees are
just employees.  So, an example:

Students:
(Jane)
(John)
(Dave)

Employees:
(Dave)
(Sue)
(Anne)

If I were to join these, I would get the students who are also
employees, or: (Dave).

However, what I want is the distinct values:

Only_Student:
(Jane)
(John)

Only_Employee:
(Sue)
(Anne)


I was able to do this in pig, but I think I should be able to do it in
one MapReduce pass.  (With hadoop 20.1) I read from two files, and
attached the file names as the values.  (Students and Employees, in this
case.  My actually problem is on DNA, bacteria and viruses in this
case.)  Then I output from the reducer if I only get one value for a
given key.  However, I've had some real trouble figuring out
MultipleOutput and the multiple inputs.  I've attached my code.  I'm
getting this error, which is a total mystery to me:

09/12/02 22:33:52 INFO mapred.JobClient: Task Id :
attempt_200911301448_0019_m_000000_2, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected
org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807)
        at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:504)
        at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)


Thanks,
Jim


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message