hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prashant <prashan...@imaginea.com>
Subject Re: Compare effectively TerraBytesofRecords with another Using Hadoop-(MapReduce)?
Date Mon, 26 Sep 2011 06:48:20 GMT
On 09/26/2011 11:58 AM, Sharan34140 wrote:
> I had this doubt for quite a long time.Could be absurd even but need the
> solutions .
> How do we compare efficiently compare 2 files each containing terabytes of
> record ?
> This could be related to external sorting as well.
> But couldnt find a efficeint solution to it.
> Can somebody please help in understanding how to proceed?
Before proceeding. Can you provide us with more details, like Is 
comparison to be done involves line by line comparison of files and 
display the diff or Is it a record ?. In either case one might have to 
override Fileinputformat which would accept two files in question and 
process them line by line or by record. And then in map we can emit the 
diff with Record number as key and diff as value. I have not tried this 
would be interesting if someone with experience can throw some light.

Thanks
Prashant

Mime
View raw message