spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: coalesce on SchemaRDD in pyspark
Date Fri, 12 Sep 2014 16:23:18 GMT
On Fri, Sep 12, 2014 at 8:55 AM, Brad Miller <bmiller1@eecs.berkeley.edu> wrote:
> Hi Davies,
>
> Thanks for the quick fix. I'm sorry to send out a bug report on release day
> - 1.1.0 really is a great release.  I've been running the 1.1 branch for a
> while and there's definitely lots of good stuff.
>
> For the workaround, I think you may have meant:
>
> srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx)

Yes, thanks for the correction.

> Note:
> "_schema_rdd" -> "_jschema_rdd"
> "false" -> "False"
>
> That workaround seems to work fine (in that I've observed the correct number
> of partitions in the web-ui, although haven't tested it any beyond that).
>
> Thanks!
> -Brad
>
> On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu <davies@databricks.com> wrote:
>>
>> This is a bug, I had create an issue to track this:
>> https://issues.apache.org/jira/browse/SPARK-3500
>>
>> Also, there is PR to fix this: https://github.com/apache/spark/pull/2369
>>
>> Before next bugfix release, you can workaround this by:
>>
>> srdd = sqlCtx.jsonRDD(rdd)
>> srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)
>>
>>
>> On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller <bmiller1@eecs.berkeley.edu>
>> wrote:
>> > Hi All,
>> >
>> > I'm having some trouble with the coalesce and repartition functions for
>> > SchemaRDD objects in pyspark.  When I run:
>> >
>> > sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}',
>> > '{"foo":"baz"}'])).coalesce(1)
>> >
>> > I get this error:
>> >
>> > Py4JError: An error occurred while calling o94.coalesce. Trace:
>> > py4j.Py4JException: Method coalesce([class java.lang.Integer, class
>> > java.lang.Boolean]) does not exist
>> >
>> > For context, I have a dataset stored in a parquet file, and I'm using
>> > SQLContext to make several queries against the data.  I then register
>> > the
>> > results of these as queries new tables in the SQLContext.  Unfortunately
>> > each new table has the same number of partitions as the original
>> > (despite
>> > being much smaller).  Hence my interest in coalesce and repartition.
>> >
>> > Has anybody else encountered this bug?  Is there an alternate workflow I
>> > should consider?
>> >
>> > I am running the 1.1.0 binaries released today.
>> >
>> > best,
>> > -Brad
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message