orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: String stats requirements?
Date Wed, 07 Jun 2017 01:58:31 GMT
> I agree that we want to be able to trim the values. I've seen cases where
>  the String is huge (~100k) and makes the StringStatistics huge. I'd propose
>  that we do something like:

The only concrete consumer of this data outside of ORC readers is probably
partial scan computation of statistics from the footers.

In some cases, I find it better to avoid computing min-max ranges, when the strings 
exceed a useful length as keeping that updated involves a comparison for every
new row.

Long json strings or URLs usually are slower to write simply from this comparison.

So this is a great idea, with the appropriate indication to the partial scan reader 
not to update stats for those columns.


View raw message