Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B7AAD18ABF for ; Mon, 19 Oct 2015 08:11:44 +0000 (UTC) Received: (qmail 4066 invoked by uid 500); 19 Oct 2015 08:11:41 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 3965 invoked by uid 500); 19 Oct 2015 08:11:41 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 3954 invoked by uid 99); 19 Oct 2015 08:11:41 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 19 Oct 2015 08:11:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 2ACA6C25BD for ; Mon, 19 Oct 2015 08:11:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.552 X-Spam-Level: X-Spam-Status: No, score=-0.552 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id dt6jZhun65_x for ; Mon, 19 Oct 2015 08:11:40 +0000 (UTC) Received: from mail-yk0-f172.google.com (mail-yk0-f172.google.com [209.85.160.172]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id D662F2074E for ; Mon, 19 Oct 2015 08:11:39 +0000 (UTC) Received: by ykfy204 with SMTP id y204so135421420ykf.1 for ; Mon, 19 Oct 2015 01:11:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=Ug0GTv9NX5Q/E7ltT8emqTuvV4LdHUsIO0sNyI7FNYw=; b=S0WDEOzS3oODiAm+CwEXDzAz0+4epIVZDAnDta6/wxyuaYvoBSvd+vRkpbv/UTQ/hX s0KisRZv9P4ecWNVTIdRBEfPxziMZLBeit7FpuDnPglQXTzSWAeLrzELG/Ba9OkOynlj saT+dbguhTFMJtwBtIWy5/P+yKeLBvS1qzpSFnnIkHJ7SEIqrlD1R8RUIjvfvNCIHmIw a4Ncsu7mNH7tJew6MtQnmGwysFkwCGREfk8M/wQYSMbQtpJ+hc7O2qHmuADiuU4TP0vi mrYE9Ddp9Iih8Z6LSONh2pCryP6T/2rcGJk2Z8yqutc6jkONlfMi1rVRVJdLQzSxaJFU q1gA== MIME-Version: 1.0 X-Received: by 10.129.154.67 with SMTP id r64mr3228567ywg.166.1445242298979; Mon, 19 Oct 2015 01:11:38 -0700 (PDT) Received: by 10.13.241.135 with HTTP; Mon, 19 Oct 2015 01:11:38 -0700 (PDT) Date: Mon, 19 Oct 2015 01:11:38 -0700 Message-ID: Subject: best way to generate per key auto increment numerals after sorting From: fahad shah To: user@spark.apache.org Content-Type: text/plain; charset=UTF-8 Hi I wanted to ask whats the best way to achieve per key auto increment numerals after sorting, for eg. : raw file: 1,a,b,c,1,1 1,a,b,d,0,0 1,a,b,e,1,0 2,a,e,c,0,0 2,a,f,d,1,0 post-output (the last column is the position number after grouping on first three fields and reverse sorting on last two values) 1,a,b,c,1,1,1 1,a,b,d,0,0,3 1,a,b,e,1,0,2 2,a,e,c,0,0,2 2,a,f,d,1,0,1 I am using solution that uses groupbykey but that is running into some issues (possibly bug with pyspark/spark?), wondering if there is a better way to achieve this. My solution: A = A = sc.textFile("train.csv").filter(lambda x:not isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is None) B = A.map(lambda k: ((k.first_field,k.second_field,k.first_field,k.third_field), (k[0:5]))).groupByKey() B.map(sort_n_set_position).flatMap(lambda line: line) where sort and set position iterates over the iterator and performs sorting and adding last column. best fahad --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org