hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markovitz, Dudu" <dmarkov...@paypal.com>
Subject RE: Get 100 items in Comma Separated strings from Hive Column.
Date Fri, 10 Jun 2016 06:33:52 GMT
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count from t;

1              2
2              5
3              24
4              17
5              8
6              11
7              26
8              18
9              9


From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1
as count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace

hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma separated string items.
 Is there a way to retrieve only 100 string items from that Column. Also we need to capture
number of comma separated string items. We are looking for more of   "substring_index" functionality,
since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there
a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF
available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5
Mime
View raw message