hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <>
Subject Re: Hive LIKE predicate. '_' wildcard decrease perfomance
Date Thu, 04 Aug 2016 19:15:16 GMT
> where res_url like ''
> where res_url like '%mts_ru%'
> Why '_' wildcard decrease perfomance?

Because it misses the fast path by just one "_".

ORC vectorized reader has a zero-copy check for 3 patterns - prefix,
suffix and middle.

That means "https://%", "%.html", "" will hit the fast path -
which uses StringExpr::equal() which JITs into the following.

In Hive-2.0, you can mix these up too to get "https:%mts%.html" in a

Anything other than these 3 cases becomes a Regex and takes the slow path.

The pattern you mentioned gets rewritten into ".**" and the inner
loop has a new String() as the input to the matcher + matcher.matches() in

I've put in some patches recently which rewrite it Lazy regexes like
".?**", so the regex DFA will be smaller (HIVE-13196).

That improves the case where the pattern is found, but does nothing to
improve the performance of the new String() GC garbage.


View raw message