spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Woody <patrick.woo...@gmail.com>
Subject Re: Lazy casting with Catalyst
Date Sat, 28 Mar 2015 16:26:48 GMT
Hey Cheng,

I didn't meant that catalyst casting was eager, just that my approaches
thus far seem to have been. Maybe I should give a concrete example?

I have columns A, B, C where B is saved as a String but I'd like all
references to B to go through a Cast to decimal regardless of the code used
on the SchemaRDD. So if someone does a min(B) it uses Decimal ordering
instead of String.

One approach that I had taken was to do a select of everything with the
casts on certain columns, but then when I did a count(literal(1)) on top of
that RDD it seemed to bring in the whole row.

Thanks!
-Pat

On Sat, Mar 28, 2015 at 11:35 AM, Cheng Lian <lian.cs.zju@gmail.com> wrote:

> Hi Pat,
>
> I don't understand what "lazy casting" mean here. Why do you think current
> Catalyst casting is "eager"? Casting happens at runtime, and doesn't
> disable column pruning.
>
> Cheng
>
>
> On 3/28/15 11:26 PM, Patrick Woody wrote:
>
>> Hi all,
>>
>> In my application, we take input from Parquet files where BigDecimals are
>> written as Strings to maintain arbitrary precision.
>>
>> I was hoping to convert these back over to Decimal with Unlimited
>> precision, but I'd still like to maintain the Parquet column pruning (all
>> my attempts thus far seem to bring in the whole Row). Is it possible to do
>> this lazily through catalyst?
>>
>> Basically I'd want to do Cast(col, DecimalType()) whenever col is actually
>> referenced. Any tips on how to approach this would be appreciated.
>>
>> Thanks!
>> -Pat
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message