hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anandha L Ranganathan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6140) trim udf is very slow
Date Sat, 11 Jan 2014 20:10:51 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868867#comment-13868867
] 

Anandha L Ranganathan commented on HIVE-6140:
---------------------------------------------

Here is the system configuration 
4 core, 8 GB RAM.
file format: Text
compression : NONE

1)  select count(l) from letters where l = 'l ';
 around 100 seconds.

2)  select count(l) from letters where trim(l) = 'l';
230 seconds

3)I created GenericUDF function for trim and the result was
  select count(l) from letters where gentrim(l) = 'l';
220 seconds.


This evaluate function is taking around 1500 nano seconds processing for each record. This
 nano seconds accumulates and takes 230 seconds when we use  UDF function for 500M records.

This is the code that is used in evaluate.

        if (arguments[0].get() == null) {
                return null;
        }
       
        input = (Text) converters[0].convert(arguments[0].get());
        input.set(input.toString().trim());



[~ehans]]
I haven't tried ORC file format. I will try later.

> trim udf is very slow
> ---------------------
>
>                 Key: HIVE-6140
>                 URL: https://issues.apache.org/jira/browse/HIVE-6140
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>            Reporter: Thejas M Nair
>            Assignee: Anandha L Ranganathan
>         Attachments: temp.pl
>
>
> Paraphrasing what was reported by [~cartershanklin] -
> I used the attached Perl script to generate 500 million two-character strings which always
included a space. I loaded it using:
> create table letters (l string); 
> load data local inpath '/home/sandbox/data.csv' overwrite into table letters;
> Then I ran this SQL script:
> select count(l) from letters where l = 'l ';
> select count(l) from letters where trim(l) = 'l';
> First query = 170 seconds
> Second query  = 514 seconds



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message