hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Burkhardt (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-2070) Cartesian product file split
Date Wed, 15 Sep 2010 23:04:35 GMT
Cartesian product file split

                 Key: MAPREDUCE-2070
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2070
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
    Affects Versions: 0.22.0
            Reporter: Paul Burkhardt
            Priority: Minor

Generates a Cartesian product of file pairs from two directory inputs and enables a RecordReader
to optimally read the split in tuple order, eliminating extraneous read operations.

The new InputFormat generates a split comprised of file combinations as tuples. The size of
the split is configurable. A RecordReader employs the convenience class, CartesianProductFileSplitReader,
to generate file pairs in tuple ordering. The actual read operations are delegated to the
RecordReader which must implement the CartesianProductTupleReader interface. An implementor
of a RecordReader can perform file manipulations without restriction and also benefit from
the optimization of tuple ordering.

In the Cartesian product of two sets with cardinalities, X and Y, each element x in {X } need
only be referenced once, saving X(Y-1) references of the elements. If the Cartesian product
is split into subsets of size N there are then X(Y/N) instead of XY references for a difference
of XY(N-1)/N. Suppose each x is equal in size, s, this would save reading sXY(N-1)/N bytes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message