hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1849) Implement a FlumeJava-like library for operations over parallel collections using Hadoop MapReduce
Date Wed, 09 Jun 2010 23:59:17 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877270#action_12877270
] 

Jake Mannix commented on MAPREDUCE-1849:
----------------------------------------

+1 from this casual observer over from Mahout-land (nobody ever seems to believe me that this
would make Hadoop programmers soooooo much more efficient).

I've written a half-baked, bug-ridden, inefficient version of this several times in the past,
and it would be *so* useful to have done right.

An api which essentially wrapped a SequenceFile<K,V> and allowed you to do things like

  Path dataPath = new Path("hdfs://foo/bar");
  PTable<K,V> data = new PTable<K,V>(dataPath);
  LightWeightMap<K,V,KOUT,VOUT> map = new MyMapper();
  PTable<KOUT,VOUT> transformedData = data.parallelDo(map);

etc. would be awesome.

Of course, the real trick is writing a good optimizer which can figure out how to squish together
separate M/R steps into one (for example, parallelDo() returns a PCollection, which you might
then do groupByKey() on, but these could often easily be combined into the Map and Reduce
steps of a single job).

> Implement a FlumeJava-like library for operations over parallel collections using Hadoop
MapReduce
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1849
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1849
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> The API used internally at Google is described in great detail at http://portal.acm.org/citation.cfm?id=1806596.1806638.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message