hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
Date Wed, 10 Sep 2008 01:55:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629677#action_12629677

Chris Douglas commented on HADOOP-4143:

There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define ways of extracting
bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or emitted as
bytes from the map can use a partitioner that understands the underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following proposal)

And, of course, disadvantages
# Different results if the key contains state determining its partition that is not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to resolve

Ideally, a partitioner could handle either raw or object data depending on its configuration.
Unfortunately, {{Partitioner}} is an interface, so adding a {{boolean isRaw()}} method would
break a *lot* of user code. That said, the interface can be safely deprecated and replaced
with an abstract class:
public abstract class Partitioner<K,V> {
  public abstract getPartition(K key, V value, int numPartitions);
  public boolean isRaw() { return false; }
  public getPartition(byte[] keyBytes, int offset, int length, int numPartitions) throws IOException
    throw new UnsupportedOperationException("Not a raw Partitioner");

A wrapper would be trivial to write, could be a final class, etc. providing backwards compatibility
for a few releases. In support of records larger than the serialization buffer, there are
at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. For a sane
set of partitioners, serializers, and value types, this should be compatible with the existing

Modifications to {{MapTask}} would be minimal. A final bool can determine which overloaded
{{getPartition}} is called and there would be the backwards-compatible wrapping of the deprecated
Partitioner interface.

Thoughts? This doesn't solve a general problem, but it's useful for partitioning relative
to a set of keys and particularly helpful in keeping the API to binary partitioners sane.

> Support for a "raw" Partitioner that partitions based on the serialized key and not record
> --------------------------------------------------------------------------------------------------
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
> For some partitioners (particularly those using comparators to classify keys), it would
be helpful if one could specify a "raw" partitioner that would receive the serialized version
of the key rather than the object emitted from the map.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message