hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kumar Kandasami <kumaravel.kandas...@gmail.com>
Subject Re: Comparing two logs, finding missing records
Date Sun, 26 Jun 2011 05:34:59 GMT
Mark -

  A thought around accomplishing this as a MapReduce Job - if you could add
the the datasource information in the mapper phase with record id as the
key, in the reducer phase you can look for record ids with missing
datasource and print the record id.

Driver Code:

          MultipleInputs.addInputPath(conf, log1path, InputFormat,
          MultipleInputs.addInputPath(conf, log2path, InputFormat,

Mapper Phase -

          Output - Key - Record Id, Value contains the datasource in
addition to other values.
          Logic - add the datasource information to the record.

Reduce Phase -

          Output - Print the Record Id that does not have log2 or log1
datasource value.
          Logic - add to the output only records that does not have log1 or
log2 datasource.

Kumar    _/|\_

On Sat, Jun 25, 2011 at 11:39 PM, Mark Kerzner <markkerzner@gmail.com>wrote:

> Hi,
> I have two logs which should have all the records for the same record_id,
> in
> other words, if this record_id is found in the first log, it should also be
> found in the second one. However, I suspect that the second log is filtered
> out, and I need to find the missing records. Anything is allowed: MapReduce
> job, Hive, Pig, and even a NoSQL database.
> Thank you.
> It is also a good time to express my thanks to all the members of the group
> who are always very helpful.
> Sincerely,
> Mark

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message