hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: timeseries merge/join question
Date Thu, 05 Nov 2009 17:51:58 GMT
Jason,

I am not sure that the join package would work in this particular case.

The javadoc says "Given a set of sorted datasets keyed with the same
class and yielding equal partitions, it is possible to effect a join
of those datasets prior to the map" -- Calvin's datasets are not
necessarily equally partitioned.

Moreover, mapred.join assumes equijoins (equality can be modified by
providing a key comparator).  Calvin's use case is "join to the record
before the equal record" which requires more context than a Comparator
sees. Are you suggesting that one could use a custom JoinRecordReader?
I suspect even so, there is the edge problem -- if timestamps of first
records on a given partition are equal, you want to fetch the last
record of the previous partition.

-Dmitriy

On Thu, Nov 5, 2009 at 6:20 AM, Jason Venner <jason.hadoop@gmail.com> wrote:
> org.apache.hadoop.mapred.join in hadoop 19
>
> On Wed, Nov 4, 2009 at 10:20 AM, Dmitriy Ryaboy <dvryaboy@gmail.com> wrote:
>>
>> > This is essentially how Pig does its skewed join, with some
>> > simplifications.
>>
>> Um, I meant merge join :-).
>
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>

Mime
View raw message