hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Klaas Bosteels (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5979) Streaming partitioner should allow command, not just Java class
Date Tue, 09 Jun 2009 09:19:10 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717622#action_12717622

Klaas Bosteels commented on HADOOP-5979:

bq. But still, the command needs to have an idea of how many partitions there are, isn't it?
Or maybe, you are saying that it's up to the command developer to assume a certain partition
count and implement the command... I agree that it's simple but am not sure whether all use
cases would be covered with this model..

Maybe it doesn't cover every possible use case, but it should cover the most common ones I
think, and in case of streaming it might be more important to implement something that's very
simple and easy to use instead of trying to make things as general as possible. Personally,
I don't think I ever implemented a partitioner that couldn't be replaced by a command that
outputs keys which then get hashed to determine the partition number. 

bq. What did you mean by "we wouldn't need any additional reading/writing logic" ? There is
at least that much reading/writing as your code outlined, ist it?

I meant that {{org.apache.hadoop.streaming.io.InputWriter}} and {{org.apache.hadoop.streaming.io.OutputReader}}
wouldn't have to be extended in any way.

Having said that, extending {{InputWriter}} and {{OutputReader}} is perfectly feasible, so
if you think it's better to work with partition numbers directly we could also implement something
public int getPartition(K2 key, V2 value, int numPartitions) {
  if (!ignoreKey) {
  return outReader_.readNumber();
This would definitely be more flexible and might also be more efficient in certain cases,
so maybe it is indeed preferable. I guess that a partitioner command would also be a rather
advanced feature anyway, so maybe it's fine to expect a bit more effort from the people who
use it and let it determine the partition number directly.

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
> Since HADOOP-4842 got committed, Streaming allows both commands and Java classes to be
specified as mapper, reducer, and combiner, but the {{-partitioner}} option is still limited
to Java classes only. Allowing commands to be specified as partitioner as well would greatly
improve the flexibility of Streaming programs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message