spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randy Gelhausen <rgel...@gmail.com>
Subject Re: Spark 1.6.1: Unexpected partition behavior?
Date Sun, 26 Jun 2016 22:38:19 GMT
Sorry, please ignore the above.

I now see I called coalesce on a different reference, than I used to
register the table.

On Sun, Jun 26, 2016 at 6:34 PM, Randy Gelhausen <rgelhau@gmail.com> wrote:

> <code>
> val enriched_web_logs = sqlContext.sql("""
> select web_logs.datetime, web_logs.node as app_host, source_ip, b.node as
> source_host, log
> from web_logs
> left outer join (select distinct node, address from nodes) b on source_ip
> = address
> """)
>
> enriched_web_logs.coalesce(1).write.format("parquet").mode("overwrite").save(bucket+"derived/enriched_web_logs")
> enriched_web_logs.registerTempTable("enriched_web_logs")
> sqlContext.cacheTable("enriched_web_logs")
> </code>
>
> There are only 524 records in the resulting table, and I have explicitly
> attempted to coalesce into 1 partition.
>
> Yet my Spark UI shows 200 (mostly empty) partitions:
> RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize
> in ExternalBlockStoreSize on Disk
> In-memory table enriched_web_logs
> <http://localhost:4040/storage/rdd?id=86> Memory Deserialized 1x
> Replicated 200 100% 22.0 KB 0.0 B 0.0 BWhy would there be 200 partitions
> despite the coalesce call?
>

Mime
View raw message