hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James R. Leek" <le...@llnl.gov>
Subject Hadoop with Multiple Inpus and Outputs
Date Thu, 03 Dec 2009 06:47:05 GMT
I've been trying to figure out how to do a set difference in hadoop.  I 
would like to take 2 file, and remove the values they have in common 
between them.  Let's say I have two bags, 'students' and 'employees'.  I 
want to find which students are just students, and which employees are 
just employees.  So, an example:

Students:
(Jane)
(John)
(Dave)

Employees:
(Dave)
(Sue)
(Anne)

If I were to join these, I would get the students who are also 
employees, or: (Dave).

However, what I want is the distinct values:

Only_Student:
(Jane)
(John)

Only_Employee:
(Sue)
(Anne)


I was able to do this in pig, but I think I should be able to do it in 
one MapReduce pass.  (With hadoop 20.1) I read from two files, and 
attached the file names as the values.  (Students and Employees, in this 
case.  My actually problem is on DNA, bacteria and viruses in this 
case.)  Then I output from the reducer if I only get one value for a 
given key.  However, I've had some real trouble figuring out 
MultipleOutput and the multiple inputs.  I've attached my code.  I'm 
getting this error, which is a total mystery to me: 

09/12/02 22:33:52 INFO mapred.JobClient: Task Id : 
attempt_200911301448_0019_m_000000_2, Status : FAILED
java.io.IOException: Type mismatch in key from map: expected 
org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807)
        at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:504)
        at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)


Thanks,
Jim

Mime
View raw message