hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leonidas Fegaras <leo...@hotmail.com>
Subject Self-joins with
Date Thu, 10 Feb 2011 19:21:22 GMT

Hi,It try to do a self-join on a file using MultipleInputs on hadoop 0.21.0. A self-join is
when you join a file with itself(for example, if you want to dereference the idrefs in an
XML document). I use the following code: 
	MultipleInputs.addInputPath(job,new Path(file1),TextInputFormat.class,JoinMapperLeft.class);
MultipleInputs.addInputPath(job,new Path(file2),TextInputFormat.class,JoinMapperRight.class);
It works fine for two different files file1 and file2. It also works if I copy file1 to file2
or if I create a symbolic linkfrom file2 to file1. It does not work if the file1 path is exactly
the same as the file2 path (it parses file2 completelyand applies the map function to the
file2 content, but it thinks that file1 is empty). Is this a bug or is it done intentionally?Is
there any way to do a self-join other than copying the file or creating a symbolic link?Thank
youLeonidas Fegaras
View raw message