hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects
Date Wed, 10 Sep 2008 01:55:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629677#action_12629677
] 

Chris Douglas commented on HADOOP-4143:
---------------------------------------

There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define ways of extracting
bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or emitted as
bytes from the map can use a partitioner that understands the underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following proposal)

And, of course, disadvantages
# Different results if the key contains state determining its partition that is not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to resolve

Ideally, a partitioner could handle either raw or object data depending on its configuration.
Unfortunately, {{Partitioner}} is an interface, so adding a {{boolean isRaw()}} method would
break a *lot* of user code. That said, the interface can be safely deprecated and replaced
with an abstract class:
{code}
public abstract class Partitioner<K,V> {
  public abstract getPartition(K key, V value, int numPartitions);
  public boolean isRaw() { return false; }
  public getPartition(byte[] keyBytes, int offset, int length, int numPartitions) throws IOException
{
    throw new UnsupportedOperationException("Not a raw Partitioner");
  }
}
{code}

A wrapper would be trivial to write, could be a final class, etc. providing backwards compatibility
for a few releases. In support of records larger than the serialization buffer, there are
at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. For a sane
set of partitioners, serializers, and value types, this should be compatible with the existing
semantics.

Modifications to {{MapTask}} would be minimal. A final bool can determine which overloaded
{{getPartition}} is called and there would be the backwards-compatible wrapping of the deprecated
Partitioner interface.

Thoughts? This doesn't solve a general problem, but it's useful for partitioning relative
to a set of keys and particularly helpful in keeping the API to binary partitioners sane.

> Support for a "raw" Partitioner that partitions based on the serialized key and not record
objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>
> For some partitioners (particularly those using comparators to classify keys), it would
be helpful if one could specify a "raw" partitioner that would receive the serialized version
of the key rather than the object emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message