hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ghigliotti, Matthew" <Matthew.Ghiglio...@garmin.com>
Subject RE: Self-joins with
Date Thu, 10 Feb 2011 22:22:34 GMT
I'm inclined to say that this is (another) bug in MultipleInputs. Staring at the source code
shows that the source of your headache is coming from the getMapperTypeMap() method, which
builds a mapping between the input paths and the mapper classes. Since you're reusing the
same path for multiple mappers, the earlier entry in the Configuration object is overwritten.

A while back, I pointed out a different bug in MultipleInputs, where you could not use input
paths which utilized globs with commas (such as "/data/{January,February,March}.txt"). Since
commas are used as delimiters to separate (Path, Mapper) and (Path, InputFormat) pairs within
the Configuration object, such paths-with-comma-globs explode horribly.

*Matthew Ghigliotti*

From: Leonidas Fegaras [mailto:leofeg@hotmail.com]
Sent: Thursday, February 10, 2011 12:21 PM
To: mapreduce-user@hadoop.apache.org
Subject: Self-joins with

It try to do a self-join on a file using MultipleInputs on hadoop 0.21.0. A self-join is when
you join a file with itself
(for example, if you want to dereference the idrefs in an XML document). I use the following

MultipleInputs.addInputPath(job,new Path(file1),TextInputFormat.class,JoinMapperLeft.class);
MultipleInputs.addInputPath(job,new Path(file2),TextInputFormat.class,JoinMapperRight.class);

It works fine for two different files file1 and file2. It also works if I copy file1 to file2
or if I create a symbolic link
from file2 to file1. It does not work if the file1 path is exactly the same as the file2 path
(it parses file2 completely
and applies the map function to the file2 content, but it thinks that file1 is empty). Is
this a bug or is it done intentionally?
Is there any way to do a self-join other than copying the file or creating a symbolic link?
Thank you
Leonidas Fegaras

This e-mail and any attachments may contain confidential material for the sole use of the
intended recipient. If you are not the intended recipient, please be aware that any disclosure,
copying, distribution or use of this e-mail or any attachment is prohibited. If you have received
this e-mail in error, please contact the sender and delete all copies.

Thank you for your cooperation.

View raw message