hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markovitz, Dudu" <dmarkov...@paypal.com>
Subject RE: Get 100 items in Comma Separated strings from Hive Column.
Date Fri, 10 Jun 2016 14:24:20 GMT
regexp_extract ('(,?[^,]*){0,10}',0)

(...){0,10}

The expression surrounded by brackets repeats 0 to 10 times.


(,?[…]*)

Optional comma followed by sequence (0 or more) of characters


[^,]

Any character which is not comma


regexp_extract (...,0)

0 stands for the whole expression
1 stands for the 1st expression which is surrounded by brackets (ordered by the opening brackets)
2 stands for the 2nd expression which is surrounded by brackets (ordered by the opening brackets)
3 stands for the 3rd expression which is surrounded by brackets (ordered by the opening brackets)
Etc.



regexp_replace (((,?[^,]*){0,10}).*','$1')

Similar to regexp_extract but this time we’re not extracting the first 10 tokens but replacing
the whole expression with the first 10 tokens.
The expression that stands for the first 10 tokens is identical to the one we used in regexp_extract
.* stands for any character that repeats 0 or more times which represent anything following
the first 10 tokens
$1 stands for the 1st expression which is surrounded by brackets (ordered by the opening brackets)


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Friday, June 10, 2016 2:54 PM
To: user@hive.apache.org
Subject: Re: Get 100 items in Comma Separated strings from Hive Column.


Thanks Dudu. I will check. Can you please throw some light on regexp_replace (((,?[^,]*){0,10}).*','$1')
 regexp_extract ('(,?[^,]*){0,10}',0),

On 6/9/2016 11:33 PM, Markovitz, Dudu wrote:
+ Improvement

The “Count” can be done in a cleaner way
(The previous way works also with simple ‘replace’)

hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count from t;

1              2
2              5
3              24
4              17
5              8
6              11
7              26
8              18
9              9


From: Markovitz, Dudu [mailto:dmarkovitz@paypal.com]
Sent: Thursday, June 09, 2016 11:30 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: RE: Get 100 items in Comma Separated strings from Hive Column.

----------------------------------------------------------------------------------------------------
--  bash
----------------------------------------------------------------------------------------------------
mkdir t

cat>t/data.txt
1|44,85
2|56,37,83,68,43
3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2
4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80
5|48,78,39,62,16,44,43,63
6|35,97,99,19,22,50,29,84,82,25,77
7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73
8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2
9|39,64,29,14,9,42,66,56,33

hdfs dfs -put t /tmp

----------------------------------------------------------------------------------------------------
--  hive
----------------------------------------------------------------------------------------------------

create external table t
(
    RowID           int
   ,stringColumn    string
)
    row format delimited
    fields terminated by '|'
    location '/tmp/t'
;

select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1
as count from t;

1    44,85 2
2    56,37,83,68,43  5
3    33,48,42,18,23,80,31,86,48,42   24
4    77,26,95,53,11,99,74,82,7,55    17
5    48,78,39,62,16,44,43,63    8
6    35,97,99,19,22,50,29,84,82,25   11
7    80,43,82,94,81,58,70,8,70,6     26
8    66,44,66,4,80,72,81,63,51,24    18
9    39,64,29,14,9,42,66,56,33  9


Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace

hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0);

1,2,3,4,5,6,7,8,9,10

hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1');

1,2,3,4,5,6,7,8,9,10


From: Mahender Sarangam [mailto:Mahender.BigData@outlook.com]
Sent: Thursday, June 09, 2016 7:13 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Get 100 items in Comma Separated strings from Hive Column.


Hi,

We have hive table which has a single column with more than 1000 comma separated string items.
 Is there a way to retrieve only 100 string items from that Column. Also we need to capture
number of comma separated string items. We are looking for more of   "substring_index" functionality,
since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there
a way to achieve the same functionality with  "regexp_extract" and I also see there is UDF
available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java

Scenario : Table1 (Source Table)

RowID stringColumn

1 1,2,3,4...10000

2 2,4,5,8,4

3 10,11,98,100

Now i Would like to show table result structure like below

Row ID 100String count

1 1,2,3...100 10000

2 2,4,5,8,4 5

Mime
View raw message