hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Burkhardt (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-2070) Cartesian product file split
Date Wed, 15 Sep 2010 23:06:34 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul Burkhardt updated MAPREDUCE-2070:
--------------------------------------

    Attachment: MAPREDUCE-2070

Patched against the trunk.

> Cartesian product file split
> ----------------------------
>
>                 Key: MAPREDUCE-2070
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2070
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.22.0
>            Reporter: Paul Burkhardt
>            Priority: Minor
>         Attachments: MAPREDUCE-2070
>
>
> Generates a Cartesian product of file pairs from two directory inputs and enables a RecordReader
to optimally read the split in tuple order, eliminating extraneous read operations.
> The new InputFormat generates a split comprised of file combinations as tuples. The size
of the split is configurable. A RecordReader employs the convenience class, CartesianProductFileSplitReader,
to generate file pairs in tuple ordering. The actual read operations are delegated to the
RecordReader which must implement the CartesianProductTupleReader interface. An implementor
of a RecordReader can perform file manipulations without restriction and also benefit from
the optimization of tuple ordering.
> In the Cartesian product of two sets with cardinalities, X and Y, each element x in {X
} need only be referenced once, saving X(Y-1) references of the elements. If the Cartesian
product is split into subsets of size N there are then X(Y/N) instead of XY references for
a difference of XY(N-1)/N. Suppose each x is equal in size, s, this would save reading sXY(N-1)/N
bytes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message