hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Navis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-664) optimize UDF split
Date Tue, 21 Jan 2014 04:43:20 GMT

    [ https://issues.apache.org/jira/browse/HIVE-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13877185#comment-13877185
] 

Navis commented on HIVE-664:
----------------------------

Ran a simple micro test on splitting only and found it's not faster significantly (max 15%?)
than current implementation (even slower for sometimes). But reusing previous pattern string
seemed good idea. Furthermore, if OI for regex is constant type, comparing itself can be ignored.
Could you do that too?

> optimize UDF split
> ------------------
>
>                 Key: HIVE-664
>                 URL: https://issues.apache.org/jira/browse/HIVE-664
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>            Reporter: Namit Jain
>            Assignee: Teddy Choi
>              Labels: optimization
>         Attachments: HIVE-664.1.patch.txt, HIVE-664.2.patch.txt, HIVE-664.3.patch.txt
>
>
> Min Zhou added a comment - 21/Jul/09 07:34 AM
> It's very useful for us .
> some comments:
>    1. Can you implement it directly with Text ? Avoiding string decoding and encoding
would be faster. Of course that trick may lead to another problem, as String.split uses a
regular expression for splitting.
>    2. getDisplayString() always return a string in lowercase.
> [ Show » ]
> Min Zhou added a comment - 21/Jul/09 07:34 AM It's very useful for us . some comments:
>    1. Can you implement it directly with Text ? Avoiding string decoding and encoding
would be faster. Of course that trick may lead to another problem, as String.split uses a
regular expression for splitting.
>    2. getDisplayString() always return a string in lowercase.
> [ Permlink | « Hide ]
> Namit Jain added a comment - 21/Jul/09 09:22 AM
> Committed. Thanks Emil
> [ Show » ]
> Namit Jain added a comment - 21/Jul/09 09:22 AM Committed. Thanks Emil
> [ Permlink | « Hide ]
> Emil Ibrishimov added a comment - 21/Jul/09 10:48 AM
> There are some easy (compromise) ways to optimize split:
> 1. Check if the regex argument actually contains some "regex specific characters" and
if it doesn't, do a straightforward split without converting to strings.
> 2. Assume some default value for the second argument (for example - split(str) to be
equivalent to split(str, ' ') and optimize for this value
> 3. Have two separate split functions - one that does regex and one that splits around
plain text.
> I think that 1 is a good choice and can be done rather quickly.
> [ Show » ]
> Emil Ibrishimov added a comment - 21/Jul/09 10:48 AM There are some easy (compromise)
ways to optimize split: 1. Check if the regex argument actually contains some "regex specific
characters" and if it doesn't, do a straightforward split without converting to strings. 2.
Assume some default value for the second argument (for example - split(str) to be equivalent
to split(str, ' ') and optimize for this value 3. Have two separate split functions - one
that does regex and one that splits around plain text. I think that 1 is a good choice and
can be done rather quickly.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message