crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kiyan Ahmadizadeh (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-73) Scrunch applications using PipelineApp do not properly serialize closures to MapReduce tasks.
Date Fri, 21 Sep 2012 20:50:07 GMT
Kiyan Ahmadizadeh created CRUNCH-73:
---------------------------------------

             Summary: Scrunch applications using PipelineApp do not properly serialize closures
to MapReduce tasks.
                 Key: CRUNCH-73
                 URL: https://issues.apache.org/jira/browse/CRUNCH-73
             Project: Crunch
          Issue Type: Bug
          Components: Scrunch
    Affects Versions: 0.4.0
            Reporter: Kiyan Ahmadizadeh
            Assignee: Kiyan Ahmadizadeh


One of the great potential advantages of using Scala for writing MapReduce pipelines is the
ability to send side data as part of function closures, rather than through Hadoop Configurations
or the Distributed Cache.  As an absurdly simple example, consider the following Scala PipelineApp
that divides all elements of a numeric PCollection by an arbitrary argument:

object DivideApp extends PipelineApp {
  val divisor = Integer.valueOf(args(0))
  val nums = read(From.textFile("numbers.txt"))
  val dividedNums = nums.map { n => n / divisor }
  dividedNums.write(To.textFile("dividedNums"))
  run()
}

Executing this PipelineApp fails.  MapReduce tasks get a value of "null" for divisor (or 0
if divisor is forced to be a primitive numeric type).  This indicates that an error is occurring
in the serialization of Scala function closures that causes unbound variables in the closure
to take on their default JVM values.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message