hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Omernik (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-5506) Hive SPLIT function does not return array correctly
Date Wed, 09 Oct 2013 21:18:44 GMT
John Omernik created HIVE-5506:
----------------------------------

             Summary: Hive SPLIT function does not return array correctly
                 Key: HIVE-5506
                 URL: https://issues.apache.org/jira/browse/HIVE-5506
             Project: Hive
          Issue Type: Bug
          Components: SQL, UDF
    Affects Versions: 0.11.0, 0.10.0, 0.9.0
         Environment: Hive
            Reporter: John Omernik


Hello all, I think I have outlined a bug in the hive split function:

Summary: When calling split on a string of data, it will only return all array items if the
the last array item has a value. For example, if I have a string of text delimited by tab
with 7 columns, and the first four are filled, but the last three are blank, split will only
return a 4 position array. If  any number of "middle" columns are empty, but the last item
still has a value, then it will return the proper number of columns.  This was tested in Hive
0.9 and hive 0.11. 

Data:
(Note \t represents a tab char, \x09 the line endings should be \n (UNIX style) not sure what
email will do to them).  Basically my data is 7 lines of data with the first 7 letters separated
by tab.  On some lines I've left out certain letters, but kept the number of tabs exactly
the same.  

input.txt
a\tb\tc\td\te\tf\tg
a\tb\tc\td\te\t\tg
a\tb\t\td\t\tf\tg
\t\t\td\te\tf\tg
a\tb\tc\td\t\t\t
a\t\t\t\te\tf\tg
a\t\t\td\t\t\tg

I then created a table with one column from that data:


DROP TABLE tmp_jo_tab_test;
CREATE table tmp_jo_tab_test (message_line STRING)
STORED AS TEXTFILE;
 
LOAD DATA LOCAL INPATH '/tmp/input.txt'
OVERWRITE INTO TABLE tmp_jo_tab_test;

Ok just to validate I created a python counting script:

#!/usr/bin/python
 
import sys
 
 
for line in sys.stdin:
    line = line[0:-1]
    out = line.split("\t")
    print len(out)

The output there is : 
$ cat input.txt |./cnt_tabs.py
7
7
7
7
7
7
7

Based on that information, split on tab should return me 7 for each line as well:

hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;"
 
7
7
7
7
4
7
7

However it does not.  It would appear that the line where only the first four letters are
filled in(and blank is passed in on the last three) only returns 4 splits, where there should
technically be 7, 4 for letters included, and three blanks.  

a\tb\tc\td\t\t\t 




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message