hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Pouttu-Clarke <Matt.Pouttu-Cla...@icrossing.com>
Subject Re: Compare two huge files
Date Mon, 25 Oct 2010 17:30:34 GMT
@Shi Yu:
Yes there are built in functions to get the input file Path in the Mapper
(you can use these for counters by putting the file name in the counter
name), however there are some issues if you use MultipleInputs to your job.
Here's some sample code I wrote to work around the issue (execute in a
 Path filePath = null;
 Object obj = reporter.getInputSplit();
 if(!(obj instanceof FileSplit)) {
 Class clazz = obj.getClass();
 try { 
 Method inputSplitMethod = clazz.getDeclaredMethod(
 "getInputSplit", new Class[0]);
 Object inputSplit = inputSplitMethod.invoke(obj, new Object[0]);
 if(inputSplit instanceof FileSplit) {
 filePath = ((FileSplit)inputSplit).getPath();
 } catch(Exception e) {
 throw new IOException(
 "Could not find input FileSplit in Mapper", e);
 } else { 
 FileSplit fs = (FileSplit)obj;
 filePath = fs.getPath();
 if(filePath == null) {
 throw new IOException(
"Could not find input FileSplit in Mapper");
 if(LOG.isDebugEnabled()) LOG.debug("filePath: " + filePath);

Using Cloudera Hadoop 0.20.1+169.113
Subversion  -r 6c765a47a9291470d3d8814c98155115d109d71

I also logged this with Cloudera, please vote for it if you want this fixed:


On 10/22/10 6:01 PM, "Shi Yu" <shiyu@uchicago.edu> wrote:

> My late thanks to the nice advice. I have tried this, it works. However,
> to produce the line number I had to rescan the files again, add new line
> numbers and then resave them as new files. It took a long time because
> they are very big. Are there any built in functions that could
> automatically provide the current filename (if there are multiple files)
> and the line numbers in Map/Reduce?
> Shi
> On 2010-10-20 21:16, Hieu Khac Le wrote:
>> How about using the line number as the key and the string at that line as
>> value.
>> -------
>> Please excuse typos and brief nature of this email sent from my mobile device
>> On Oct 20, 2010, at 9:07 PM, Shi Yu<shiyu@uchicago.edu>  wrote:
>>> Hi,
>>> I have a problem of comparing two huge files (100G each) consist of string
>>> sequence. It is more like the file text compare problem. I would like to
>>> find out how many strings are different within these two files in the
>>> natural order. Can this task be modeled as a map/reduce job? Currently I
>>> have no idea how to control the split of map and make sure the two input
>>> threads in one map task are pointing to the same positions in the files.
>>> Shi

iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may contain confidential
and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by reply email
and destroy all copies of the original message.

View raw message