hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Marin <ro...@rogersmarin.com>
Subject Re: Analyze table compute statistics on wide table taking too long
Date Wed, 08 Apr 2015 02:58:20 GMT
Hi Gopal,

Thanks for that.

I'm happy to look into improving the Regex serde performance, any tips on
where I should start looking?.

Regards,
Roger
On 08/04/2015 11:44 AM, "Gopal Vijayaraghavan" <gopalv@apache.org> wrote:

>
> > The table also has a large Regex serde.
>
> There are no stats fast paths for Regex SerDe.
>
> The statistics computation is lifting each row into memory, parsing it and
> throwing it away.
>
> Most of your time would be spent in GC (check the GC time millis), due to
> the huge expense of the Regex Serde.
>
> For a direct comparison you could compute stats while turning it into
> another format
>
> set hive.stats.autogather=true;
> create table tmp1 stored as orc as select * from oldtable;
>
> Due to the nature of the columnar SerDes, that ETL would happen in
> parallel to the compute stats off the same stream (i.e autogather).
>
> That said, I have noticed performance issues with the RegexSerde, but
> haven¹t bothered to fix it yet - maybe you¹d want to take a shot at fixing
> it?
>
>
> Cheers,
> Gopal
>
>
>

Mime
View raw message