hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dieter Plaetinck <die...@plaetinck.be>
Subject Re: how to implements the 'diff' cmd in hadoop
Date Tue, 20 Mar 2012 11:33:29 GMT
the "diff command on linux" (i.e. gnu diffutils) is way more involved than this.
it can compare sections on different line numbers. (for example if you copy a text file to
another, and then delete or add some lines in arbitrary places, and compare them, it will
detect just that, whereas this crude logic will give a lot false positives)
the diff logic is hard to map on (and hence IMHO doesn't fit) the M/R paradigm
But what's the bigger picture here? usually you would run diff on files created by humans
(source code, notes, etc), i.e. files that can easily be diff'ed on a single machine.
If you have files that are so huge they are probably generated by software, which means you
can do more appropriate things than diffing output files.

Dieter


On Tue, 20 Mar 2012 16:43:06 +0530
Bejoy Ks <bejoy.hadoop@gmail.com> wrote:

> Yes, if you are having more than 2 files to be compared against then, the
> file name/ id is required from mapper. If it is just two files  and you
> just want to know which lines are not unique then just the line no would be
> good but if you are looking at more granular info like the exact changes in
> which all files then the value from mapper could be prefixed with some
> value like file name.
> 
> Regards
> Bejoy KS
> 
> 2012/3/20 botma lin <linjfly@gmail.com>
> 
> > Thanks  Bejoy, that makes sense .
> >
> >       If I want to know the different record's original file, I need to
> > put an extra file id into the mapper's output value, then get it in the
> > reducer .
> >
> >      Do you have any other ideas
> >
> > Thanks!.
> >
> >
> > On Tue, Mar 20, 2012 at 6:09 PM,Bejoy Ks <bejoy.hadoop@gmail.com> wrote:
> >
> > > Hi Lin
> > >        In you mapper make the line no as the key and the line contents as
> > > the value. In your reducer check whether the two values for a key are
> > > matching. ie if you are comparing two files then there would be two
> > values
> > > for a line number. If non matching patterns found increment a counter to
> > > determine the number of non matching patterns and write those patterns to
> > > output file . If the values matches for a key do nothing, no need even
> > > writing to output dir.
> > >
> > > Regards
> > > Bejoy KS
> > >
> > > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <linjfly@gmail.com> wrote:
> > >
> > > > Hi, all
> > > >
> > > >      I'm newbie to hadoop.
> > > >
> > > >      I'm trying to compare two large file and get the difference
> > between
> > > > them ,like the diff cmd in linux,
> > > >  however,  the mapred api can only get one record at a time . so how
> > can
> > > I
> > > > get the relative records in two files and compare them by using mapred
> > > api.
> > > >
> > > >     thinks!
> > > >
> > >
> >


Mime
View raw message