hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Hanson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4548) Speed up vectorized LIKE filter for special cases abc%, %abc and %abc%
Date Wed, 22 May 2013 21:44:23 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664572#comment-13664572
] 

Eric Hanson commented on HIVE-4548:
-----------------------------------

It appears that all the specific characters you are checking for in parseSimplePattern (%,
_, \) cannot be the first or last character of a surrogate pair. So I think the code is safe.
Please think this through and add some unit tests that process multi-byte UTF-8 characters
of 3 bytes or more (which will force encoding as surrogate pairs inside a String).

See http://en.wikipedia.org/wiki/UTF-16/UCS-2#Code_points_U.2B10000_to_U.2B10FFFF for a discussion
of surrogate pairs.

See http://en.wikipedia.org/wiki/List_of_Unicode_characters for a list of Unicode characters.
% is 0x0025, _ is 0x005F, and \ is 0x005C. Surrogate pairs are all have lead surrogates in
the range 0xD800..0xDBFF and trail surrogates in the range 0xDC00..0xDFFF. 
                
> Speed up vectorized LIKE filter for special cases abc%, %abc and %abc%
> ----------------------------------------------------------------------
>
>                 Key: HIVE-4548
>                 URL: https://issues.apache.org/jira/browse/HIVE-4548
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: vectorization-branch
>            Reporter: Eric Hanson
>            Assignee: Teddy Choi
>            Priority: Minor
>             Fix For: vectorization-branch
>
>         Attachments: HIVE-4548.1-with-benchmark.patch.txt, HIVE-4548.1-without-benchmark.patch.txt,
HIVE-4548.2-with-benchmark.patch.txt, HIVE-4548.2-without-benchmark.patch.txt
>
>
> Speed up vectorized LIKE filter evaluation for abc%, %abc, and %abc% pattern special
cases (here, abc is just a place holder for some fixed string).  
>   
> Problem: The current vectorized LIKE implementation always calls the standard LIKE function
code in UDFLike.java. But this is pretty expensive. It calls multiple functions and allocates
at least one new object per call. Probably 80% of uses of LIKE are for the simple patterns
abc%, %abc, and %abc%.  These can be implemented much more efficiently.
> Start by speeding up the case for  
>     Column LIKE "abc%"
>   
> The goal would be to minimize expense in the inner loop. Don't use new() in the inner
loop, and write a static function that checks the prefix of the string matches the like pattern
as efficiently as possible, operating directly on the byte array holding UTF-8-encoded string
data, and avoiding unnecessary additional function calls and if/else logic. Call that in the
inner loop.
> If feasible, consider using a template-driven approach, with an instance of the template
expanded for each of the three cases. Start doing the abc% (prefix match) by hand, then consider
templatizing for the other two cases.
> The code is in the "vectorization" branch of the main hive repo.
>   
> Start by checking in the constructor for FilterStringColLikeStringScalar.java if the
pattern is one of the simple special cases. If so, record that, and have the evaluate() method
call a special-case function for each case, i.e. the general case, and each of the 3 special
cases. All the dynamic decision-making would be done once per vector, not once per element.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message