hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shi Yu <sh...@uchicago.edu>
Subject Re: Compare two huge files
Date Sat, 23 Oct 2010 01:01:25 GMT
My late thanks to the nice advice. I have tried this, it works. However, 
to produce the line number I had to rescan the files again, add new line 
numbers and then resave them as new files. It took a long time because 
they are very big. Are there any built in functions that could 
automatically provide the current filename (if there are multiple files) 
and the line numbers in Map/Reduce?


On 2010-10-20 21:16, Hieu Khac Le wrote:
> How about using the line number as the key and the string at that line as value.
> -------
> Please excuse typos and brief nature of this email sent from my mobile device
> On Oct 20, 2010, at 9:07 PM, Shi Yu<shiyu@uchicago.edu>  wrote:
>> Hi,
>> I have a problem of comparing two huge files (100G each) consist of string sequence.
It is more like the file text compare problem. I would like to find out how many strings are
different within these two files in the natural order. Can this task be modeled as a map/reduce
job? Currently I have no idea how to control the split of map and make sure the two input
threads in one map task are pointing to the same positions in the files.
>> Shi

Postdoctoral Scholar
Institute for Genomics and Systems Biology
Department of Medicine, the University of Chicago
Knapp Center for Biomedical Discovery
900 E. 57th St. Room 10148
Chicago, IL 60637, US
Tel: 773-702-6799

View raw message