Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E134110791 for ; Tue, 15 Oct 2013 23:19:44 +0000 (UTC) Received: (qmail 3931 invoked by uid 500); 15 Oct 2013 23:19:42 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 3864 invoked by uid 500); 15 Oct 2013 23:19:42 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 3822 invoked by uid 500); 15 Oct 2013 23:19:42 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 3803 invoked by uid 99); 15 Oct 2013 23:19:42 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Oct 2013 23:19:42 +0000 Date: Tue, 15 Oct 2013 23:19:42 +0000 (UTC) From: "Vikram Dixit K (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-5506) Hive SPLIT function does not return array correctly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vikram Dixit K updated HIVE-5506: --------------------------------- Attachment: HIVE-5506.1.patch This should fix this issue. > Hive SPLIT function does not return array correctly > --------------------------------------------------- > > Key: HIVE-5506 > URL: https://issues.apache.org/jira/browse/HIVE-5506 > Project: Hive > Issue Type: Bug > Components: SQL, UDF > Affects Versions: 0.9.0, 0.10.0, 0.11.0 > Environment: Hive > Reporter: John Omernik > Assignee: Vikram Dixit K > Attachments: HIVE-5506.1.patch > > > Hello all, I think I have outlined a bug in the hive split function: > Summary: When calling split on a string of data, it will only return all array items if the the last array item has a value. For example, if I have a string of text delimited by tab with 7 columns, and the first four are filled, but the last three are blank, split will only return a 4 position array. If any number of "middle" columns are empty, but the last item still has a value, then it will return the proper number of columns. This was tested in Hive 0.9 and hive 0.11. > Data: > (Note \t represents a tab char, \x09 the line endings should be \n (UNIX style) not sure what email will do to them). Basically my data is 7 lines of data with the first 7 letters separated by tab. On some lines I've left out certain letters, but kept the number of tabs exactly the same. > input.txt > a\tb\tc\td\te\tf\tg > a\tb\tc\td\te\t\tg > a\tb\t\td\t\tf\tg > \t\t\td\te\tf\tg > a\tb\tc\td\t\t\t > a\t\t\t\te\tf\tg > a\t\t\td\t\t\tg > I then created a table with one column from that data: > DROP TABLE tmp_jo_tab_test; > CREATE table tmp_jo_tab_test (message_line STRING) > STORED AS TEXTFILE; > > LOAD DATA LOCAL INPATH '/tmp/input.txt' > OVERWRITE INTO TABLE tmp_jo_tab_test; > Ok just to validate I created a python counting script: > #!/usr/bin/python > > import sys > > > for line in sys.stdin: > line = line[0:-1] > out = line.split("\t") > print len(out) > The output there is : > $ cat input.txt |./cnt_tabs.py > 7 > 7 > 7 > 7 > 7 > 7 > 7 > Based on that information, split on tab should return me 7 for each line as well: > hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;" > > 7 > 7 > 7 > 7 > 4 > 7 > 7 > However it does not. It would appear that the line where only the first four letters are filled in(and blank is passed in on the last three) only returns 4 splits, where there should technically be 7, 4 for letters included, and three blanks. > a\tb\tc\td\t\t\t -- This message was sent by Atlassian JIRA (v6.1#6144)