hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shi Yu <sh...@uchicago.edu>
Subject Compare two huge files
Date Thu, 21 Oct 2010 02:07:01 GMT

I have a problem of comparing two huge files (100G each) consist of 
string sequence. It is more like the file text compare problem. I would 
like to find out how many strings are different within these two files 
in the natural order. Can this task be modeled as a map/reduce job? 
Currently I have no idea how to control the split of map and make sure 
the two input threads in one map task are pointing to the same positions 
in the files.


View raw message