hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5506) Hive SPLIT function does not return array correctly
Date Wed, 23 Oct 2013 03:35:43 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13802561#comment-13802561
] 

Hudson commented on HIVE-5506:
------------------------------

FAILURE: Integrated in Hive-trunk-h0.21 #2416 (See [https://builds.apache.org/job/Hive-trunk-h0.21/2416/])
HIVE-5506 : Hive SPLIT function does not return array correctly (Vikram Dixit via Ashutosh
Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1534775)
* /hive/trunk/data/files/input.txt
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSplit.java
* /hive/trunk/ql/src/test/queries/clientpositive/split.q
* /hive/trunk/ql/src/test/results/clientpositive/split.q.out
* /hive/trunk/ql/src/test/results/clientpositive/udf_split.q.out


> Hive SPLIT function does not return array correctly
> ---------------------------------------------------
>
>                 Key: HIVE-5506
>                 URL: https://issues.apache.org/jira/browse/HIVE-5506
>             Project: Hive
>          Issue Type: Bug
>          Components: SQL, UDF
>    Affects Versions: 0.9.0, 0.10.0, 0.11.0
>         Environment: Hive
>            Reporter: John Omernik
>            Assignee: Vikram Dixit K
>             Fix For: 0.13.0
>
>         Attachments: HIVE-5506.1.patch, HIVE-5506.2.patch
>
>
> Hello all, I think I have outlined a bug in the hive split function:
> Summary: When calling split on a string of data, it will only return all array items
if the the last array item has a value. For example, if I have a string of text delimited
by tab with 7 columns, and the first four are filled, but the last three are blank, split
will only return a 4 position array. If  any number of "middle" columns are empty, but the
last item still has a value, then it will return the proper number of columns.  This was tested
in Hive 0.9 and hive 0.11. 
> Data:
> (Note \t represents a tab char, \x09 the line endings should be \n (UNIX style) not sure
what email will do to them).  Basically my data is 7 lines of data with the first 7 letters
separated by tab.  On some lines I've left out certain letters, but kept the number of tabs
exactly the same.  
> input.txt
> a\tb\tc\td\te\tf\tg
> a\tb\tc\td\te\t\tg
> a\tb\t\td\t\tf\tg
> \t\t\td\te\tf\tg
> a\tb\tc\td\t\t\t
> a\t\t\t\te\tf\tg
> a\t\t\td\t\t\tg
> I then created a table with one column from that data:
> DROP TABLE tmp_jo_tab_test;
> CREATE table tmp_jo_tab_test (message_line STRING)
> STORED AS TEXTFILE;
>  
> LOAD DATA LOCAL INPATH '/tmp/input.txt'
> OVERWRITE INTO TABLE tmp_jo_tab_test;
> Ok just to validate I created a python counting script:
> #!/usr/bin/python
>  
> import sys
>  
>  
> for line in sys.stdin:
>     line = line[0:-1]
>     out = line.split("\t")
>     print len(out)
> The output there is : 
> $ cat input.txt |./cnt_tabs.py
> 7
> 7
> 7
> 7
> 7
> 7
> 7
> Based on that information, split on tab should return me 7 for each line as well:
> hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;"
>  
> 7
> 7
> 7
> 7
> 4
> 7
> 7
> However it does not.  It would appear that the line where only the first four letters
are filled in(and blank is passed in on the last three) only returns 4 splits, where there
should technically be 7, 4 for letters included, and three blanks.  
> a\tb\tc\td\t\t\t 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message