spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Spark SQL: The cached columnar table is not columnar?
Date Thu, 08 Jan 2015 10:40:03 GMT
Hey Xuelin, which data item in the Web UI did you check?

On 1/7/15 5:37 PM, Xuelin Cao wrote:
>
> Hi,
>
> Curious and curious. I'm puzzled by the Spark SQL cached table.
>
> Theoretically, the cached table should be columnar table, and only 
> scan the column that included in my SQL.
>
> However, in my test, I always see the whole table is scanned even 
> though I only "select" one column in my SQL.
>
>       Here is my code:
>
> /val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> /
> /import sqlContext._
> /
> /sqlContext.jsonFile("/data/ad.json").registerTempTable("adTable")
> /
> /sqlContext.cacheTable("adTable")  //The table has > 10 columns/
> /
> /
> ///First run, cache the table into memory//
> /
> /sqlContext.sql("select * from adTable").collect/
> /
> /
> ///Second run, only one column is used. It should only scan a small 
> fraction of data//
> /
> /sqlContext.sql("select adId from adTable").collect /
> /sqlContext.sql("select adId from adTable").collect
> /
> /sqlContext.sql("select adId from adTable").collect/
>
>         What I found is, every time I run the SQL, in WEB UI, it shows 
> the total amount of input data is always the same --- the total amount 
> of the table.
>
>         Is anything wrong? My expectation is:
>         1. The cached table is stored as columnar table
>         2. Since I only need one column in my SQL, the total amount of 
> input data showed in WEB UI should be very small
>
>         But what I found is totally not the case. Why?
>
>         Thanks
>


Mime
View raw message