spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JoshRosen <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...
Date Wed, 08 Apr 2015 19:20:23 GMT
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1977#discussion_r28003603
  
    --- Diff: python/pyspark/shuffle.py ---
    @@ -367,32 +372,13 @@ def iteritems(self):
     
         def _external_items(self):
             """ Return all partitioned items as iterator """
    -        assert not self.data
    --- End diff --
    
    I noticed that you moved a few of these assertions.  I guess the old assumption was that
once we've spilled, we'll stop using `data` and only aggregate into `pdata`, given that we
clear `data` in the first branch of `_spill`.  Why has this assumption changed here?  It looks
like we do end up writing to `data` again inside of this `_external_items` method, but then
we end up clearing `data` at the end after the iterator has been consumed.
    
    Was this change necessary in order to support iterating multiple times over the merged
result?  Just want to double-check my understanding here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message