Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BD7E818517 for ; Wed, 16 Sep 2015 13:39:43 +0000 (UTC) Received: (qmail 41390 invoked by uid 500); 16 Sep 2015 13:39:18 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 41297 invoked by uid 500); 16 Sep 2015 13:39:18 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 41287 invoked by uid 99); 16 Sep 2015 13:39:18 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Sep 2015 13:39:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C99E1F32AA for ; Wed, 16 Sep 2015 13:39:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.414 X-Spam-Level: *** X-Spam-Status: No, score=3.414 tagged_above=-999 required=6.31 tests=[KAM_COUK=1.1, KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id GY92DR2MKJNg for ; Wed, 16 Sep 2015 13:39:09 +0000 (UTC) Received: from mwork.nabble.com (mwork.nabble.com [162.253.133.43]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTP id 5FB85205B3 for ; Wed, 16 Sep 2015 13:39:09 +0000 (UTC) Received: from mben.nabble.com (unknown [162.253.133.72]) by mwork.nabble.com (Postfix) with ESMTP id 9140128DCC62 for ; Wed, 16 Sep 2015 06:39:44 -0700 (PDT) Date: Wed, 16 Sep 2015 06:39:08 -0700 (MST) From: robineast To: dev@spark.apache.org Message-ID: <1442410748610-14148.post@n3.nabble.com> In-Reply-To: <1442273765353-14116.post@n3.nabble.com> References: <1442273765353-14116.post@n3.nabble.com> Subject: Re: RDD API patterns MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit I'm not sure the problem is quite as bad as you state. Both sampleByKey and sampleByKeyExact are implemented using a function from StratifiedSamplingUtils which does one of two things depending on whether the exact implementation is needed. The exact version requires double the number of lines of code (17) than the non-exact and has to do extra passes over the data to get, for example, the counts per key. As far as I can see your problem 2 and sampleByKeyExact are very similar and could be solved the same way. It has been decided that sampleByKeyExact is a widely useful function and so is provided out of the box as part of the PairRDD API. I don't see any reason why your problem 2 couldn't be provided in the same way as part of the API if there was the demand for it. An alternative design would perhaps be something like an extension to PairRDD, let's call it TwoPassPairRDD, where certain information for the key could be provided along with an Iterable e.g. the counts for the key. Both sampleByKeyExact and your problem 2 could be implemented in a few less lines of code. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-API-patterns-tp14116p14148.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org For additional commands, e-mail: dev-help@spark.apache.org