From issues-return-181740-archive-asf-public=cust-asf.ponee.io@spark.apache.org  Mon Jan 15 19:01:04 2018
Return-Path: <issues-return-181740-archive-asf-public=cust-asf.ponee.io@spark.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id 105C0180657
	for <archive-asf-public@eu.ponee.io>; Mon, 15 Jan 2018 19:01:04 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id F1DFB160C31; Mon, 15 Jan 2018 18:01:03 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id 44796160C1C
	for <archive-asf-public@cust-asf.ponee.io>; Mon, 15 Jan 2018 19:01:03 +0100 (CET)
Received: (qmail 68181 invoked by uid 500); 15 Jan 2018 18:01:02 -0000
Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@spark.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@spark.apache.org>
List-Post: <mailto:issues@spark.apache.org>
List-Id: <issues.spark.apache.org>
Delivered-To: mailing list issues@spark.apache.org
Received: (qmail 68172 invoked by uid 99); 15 Jan 2018 18:01:02 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Jan 2018 18:01:02 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 67F98180633
	for <issues@spark.apache.org>; Mon, 15 Jan 2018 18:01:01 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -108.711
X-Spam-Level:
X-Spam-Status: No, score=-108.711 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_LOW=-0.7,
	SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id cwl1uyZpH1LP for <issues@spark.apache.org>;
	Mon, 15 Jan 2018 18:01:00 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id CEB405F23C
	for <issues@spark.apache.org>; Mon, 15 Jan 2018 18:01:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 56BC5E02D6
	for <issues@spark.apache.org>; Mon, 15 Jan 2018 18:01:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 19566212FD
	for <issues@spark.apache.org>; Mon, 15 Jan 2018 18:01:00 +0000 (UTC)
Date: Mon, 15 Jan 2018 18:01:00 +0000 (UTC)
From: "Sean Owen (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.13130939.1515997505000.3436.1516039260098@Atlassian.JIRA>
In-Reply-To: <JIRA.13130939.1515997505000@Atlassian.JIRA>
References: <JIRA.13130939.1515997505000@Atlassian.JIRA> <JIRA.13130939.1515997505414@jira-lw-us.apache.org>
Subject: [jira] [Commented] (SPARK-23074) Dataframe-ified zipwithindex
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326479#comment-16326479 ] 

Sean Owen commented on SPARK-23074:
-----------------------------------

Hm, rowNumber requires you to sort the input? I didn't think it did, semantically. The numbering isn't so meaningful unless the input has a defined ordering, sure, but the same is true of an RDD. Unless you sort it, the indexing could change when it's evaluated again.

You're not really guaranteed what order you see the data, although in practice, like in your example, you will get data from things like files in the order you expect.

> Dataframe-ified zipwithindex
> ----------------------------
>
>                 Key: SPARK-23074
>                 URL: https://issues.apache.org/jira/browse/SPARK-23074
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Minor
>              Labels: dataframe, rdd
>
> Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex():
> {code:java}
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.sql.types.{LongType, StructField, StructType}
> import org.apache.spark.sql.Row
> def dfZipWithIndex(
>   df: DataFrame,
>   offset: Int = 1,
>   colName: String = "id",
>   inFront: Boolean = true
> ) : DataFrame = {
>   df.sqlContext.createDataFrame(
>     df.rdd.zipWithIndex.map(ln =>
>       Row.fromSeq(
>         (if (inFront) Seq(ln._2 + offset) else Seq())
>           ++ ln._1.toSeq ++
>         (if (inFront) Seq() else Seq(ln._2 + offset))
>       )
>     ),
>     StructType(
>       (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]()) 
>         ++ df.schema.fields ++ 
>       (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))
>     )
>   ) 
> }
> {code}
> credits: [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex]


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org