spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-23074) Dataframe-ified zipwithindex
Date Mon, 15 Jan 2018 18:01:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326479#comment-16326479
] 

Sean Owen commented on SPARK-23074:
-----------------------------------

Hm, rowNumber requires you to sort the input? I didn't think it did, semantically. The numbering
isn't so meaningful unless the input has a defined ordering, sure, but the same is true of
an RDD. Unless you sort it, the indexing could change when it's evaluated again.

You're not really guaranteed what order you see the data, although in practice, like in your
example, you will get data from things like files in the order you expect.

> Dataframe-ified zipwithindex
> ----------------------------
>
>                 Key: SPARK-23074
>                 URL: https://issues.apache.org/jira/browse/SPARK-23074
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Minor
>              Labels: dataframe, rdd
>
> Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex():
> {code:java}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types.{LongType, StructField, StructType}
> import org.apache.spark.sql.Row
> def dfZipWithIndex(
>   df: DataFrame,
>   offset: Int = 1,
>   colName: String = "id",
>   inFront: Boolean = true
> ) : DataFrame = {
>   df.sqlContext.createDataFrame(
>     df.rdd.zipWithIndex.map(ln =>
>       Row.fromSeq(
>         (if (inFront) Seq(ln._2 + offset) else Seq())
>           ++ ln._1.toSeq ++
>         (if (inFront) Seq() else Seq(ln._2 + offset))
>       )
>     ),
>     StructType(
>       (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]())

>         ++ df.schema.fields ++ 
>       (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))
>     )
>   ) 
> }
> {code}
> credits: [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message