From issues-return-181740-archive-asf-public=cust-asf.ponee.io@spark.apache.org Mon Jan 15 19:01:04 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 105C0180657 for ; Mon, 15 Jan 2018 19:01:04 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id F1DFB160C31; Mon, 15 Jan 2018 18:01:03 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 44796160C1C for ; Mon, 15 Jan 2018 19:01:03 +0100 (CET) Received: (qmail 68181 invoked by uid 500); 15 Jan 2018 18:01:02 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 68172 invoked by uid 99); 15 Jan 2018 18:01:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Jan 2018 18:01:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 67F98180633 for ; Mon, 15 Jan 2018 18:01:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -108.711 X-Spam-Level: X-Spam-Status: No, score=-108.711 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id cwl1uyZpH1LP for ; Mon, 15 Jan 2018 18:01:00 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id CEB405F23C for ; Mon, 15 Jan 2018 18:01:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 56BC5E02D6 for ; Mon, 15 Jan 2018 18:01:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 19566212FD for ; Mon, 15 Jan 2018 18:01:00 +0000 (UTC) Date: Mon, 15 Jan 2018 18:01:00 +0000 (UTC) From: "Sean Owen (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-23074) Dataframe-ified zipwithindex MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-23074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16326479#comment-16326479 ] Sean Owen commented on SPARK-23074: ----------------------------------- Hm, rowNumber requires you to sort the input? I didn't think it did, semantically. The numbering isn't so meaningful unless the input has a defined ordering, sure, but the same is true of an RDD. Unless you sort it, the indexing could change when it's evaluated again. You're not really guaranteed what order you see the data, although in practice, like in your example, you will get data from things like files in the order you expect. > Dataframe-ified zipwithindex > ---------------------------- > > Key: SPARK-23074 > URL: https://issues.apache.org/jira/browse/SPARK-23074 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Affects Versions: 2.3.0 > Reporter: Ruslan Dautkhanov > Priority: Minor > Labels: dataframe, rdd > > Would be great to have a daraframe-friendly equivalent of rdd.zipWithIndex(): > {code:java} > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.types.{LongType, StructField, StructType} > import org.apache.spark.sql.Row > def dfZipWithIndex( > df: DataFrame, > offset: Int = 1, > colName: String = "id", > inFront: Boolean = true > ) : DataFrame = { > df.sqlContext.createDataFrame( > df.rdd.zipWithIndex.map(ln => > Row.fromSeq( > (if (inFront) Seq(ln._2 + offset) else Seq()) > ++ ln._1.toSeq ++ > (if (inFront) Seq() else Seq(ln._2 + offset)) > ) > ), > StructType( > (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]()) > ++ df.schema.fields ++ > (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false))) > ) > ) > } > {code} > credits: [https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org