spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <>
Subject [jira] [Resolved] (SPARK-27851) Allow for custom BroadcastMode#transform return values
Date Wed, 29 May 2019 01:30:00 GMT


Hyukjin Kwon resolved SPARK-27851.
    Resolution: Invalid

> Allow for custom BroadcastMode#transform return values
> ------------------------------------------------------
>                 Key: SPARK-27851
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, SQL
>    Affects Versions: 2.4.3
>            Reporter: Marc Arndt
>            Priority: Major
> According to the BroadcastMode API the BroadcastMode#transform methods are allows to
return a result object of an arbitrary type:
> {code:scala}
> /**
>  * Marker trait to identify the shape in which tuples are broadcasted. Typical examples
of this are
>  * identity (tuples remain unchanged) or hashed (tuples are converted into some hash
>  */
> trait BroadcastMode {
>   def transform(rows: Array[InternalRow]): Any
>   def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any
>   def canonicalized: BroadcastMode
> }
> {code}
> When looking at the code which later uses the instantiated BroadcastMode objects in BroadcastExchangeExec
it becomes that this is not really the base. 
> The following lines in BroadcastExchangeExec suggest that only objects of type HashedRelation
and Array[InternalRow] are allowed as a result for the BroadcastMode#transform methods:
> {code:scala}
> // Construct the relation.
> val relation = mode.transform(input, Some(numRows))
> val dataSize = relation match {
>     case map: HashedRelation =>
>         map.estimatedSize
>     case arr: Array[InternalRow] =>
>     case _ =>
>         throw new SparkException("[BUG] BroadcastMode.transform returned unexpected type:
" +
>             relation.getClass.getName)
> }
> {code}
> I believe that this is the only occurrence in the code where the result of the BroadcastMode#transform
method must be either of type HashedRelation or Array[InternalRow]. Therefore to allow for
broader custom implementations of the BroadcastMode I believe it would be a good idea to solve
the calculation of the data size of the broadcast value in an independent way of the used
BroadcastMode implemented.
> One way this could be done is by providing an additional BroadcastMode#calculateDataSize
method, which needs to be implemented by the BroadcastMode implementations. This methods could
then return the required number of bytes for a given broadcast value.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message