hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: timeseries merge/join question
Date Wed, 04 Nov 2009 17:18:45 GMT
I don't know if it's the best way, but it's *a* way:

1. Map: take dataset A and build a sparse index into it (first
timestamp in every split will do).  Reduce: Write the pairs
(timestamp, offset) into a lookup file.

2. Map: every mapper works on a chunk of B; load in the lookup file
from step one (it may make sense to use something like Distributed
Cache to prep your nodes and avoid contention on this small file), and
for every timestamp, find the largest timestamp in the lookup that is
smaller than the timestamp from B. Seek to the corresponding offset in
A. Scan forward from that point until you see the timestamp >= the one
in B. The previous record has your answer. Get the next record from B,
repeat (you can choose to keep scanning from where you are in A, or
use the index to jump if there is a big gap between B's timestamps).

This is essentially how Pig does its skewed join, with some simplifications.


On Wed, Nov 4, 2009 at 11:53 AM, Calvin <calvin.lists@gmail.com> wrote:
> Hey all,
> I am trying to figure out the best way to approach some joining/merging
> computation in a map-reduce / hbase framework.  I have the following large
> timeseries datasets (key/value pairs keyed and sorted by time):
> Events1:
> t1, event1_value1
> t3, event1_value2,
> ...
> Event2:
> t2, event2_value1
> t3, event2_value2,
> t4, event2_value3,
> ....
> Currently, I am just storing these as flat files in HDFS but I have no
> problems throwing them into HBase tables.  I am trying to do an operation
> like the following: for every event in Events2, find and join with the event
> that immediately precedes (timestamp <=) this event in table Events1.
> This operation would result in something like:
> JoinedEvents:
> t2, events2_value1, events1_value1
> t3, events2_value2, events1_value2
> t4, events2_value3, events1_value2
> etc.
> What is the best way to go about this in Hadoop?
> Thanks in advance for the help.

View raw message