hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Dixon <>
Subject RE: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings
Date Fri, 06 Feb 2015 09:18:34 GMT
Below 2 solutions.

Solution1 uses lookahead and lookbehind but works with bi-grams only.  It also doesn’t enforce
the pairs you’re asking for, so for instance hip music would become hipmusic.

Solution2 uses simple IN syntax with if(), works with n-grams beyond bi-grams and enforces
the actual patterns you want to change (so ‘hip music’ would remain ‘hip music’, however
it presumes that for each ngram you don’t want to keep any spaces.

from (
from dual
 select explode(array('hip hop','rock music')) as txt
) sub
,regexp_replace(txt,'(?<=(hip)|(rock)) (?=(hop)|(music))','') as solution1
,if(txt in ('hip hop','rock music'),regexp_replace(txt,' ',''),txt) as solution2

Using a text editor like sublime or n++ you could construct the monster regex (for solution1)
or the IN set (for solution2) to capture all the patterns you need.

From: Pradeep Gollakota []
Sent: 03 February 2015 21:56
Cc: Viral Parikh
Subject: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings

I don't think this is doable using the out of the box regexp_replace() UDF. That way I would
do it, is using a file to create a mapping between a regexp and it's replacement and write
a custom UDF that loads this file and applies all regular expressions on the input.

Hope this helps.

On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh <<>>

Hi Everyone,

I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my
data and replace them with "hiphop" and "rockmusic" - basically replace them without white
space. I have used the regexp_replace function in hive. Below is my query and it works great
for above 2 examples.

drop table vp_hiphop;

create table vp_hiphop as

select userid, ntext,

       regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic')
as ntext1

from  vp_nlp_protext_males


But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just
remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the
replace I want to simply trim the white spaces. Below is what I tried. I also tried using
trim with regexp_replace but it wants the third argument in the regexp_replace function.

drop table vp_hiphop;

create table vp_hiphop as

select  userid, ntext,

        regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1

from  vp_nlp_protext_males




The sender does not guarantee that this message, including any attachment, is secure or virus
free. Also, it is confidential and may be privileged or otherwise protected from disclosure.
If you are not the intended recipient, do not disclose or copy it or its contents. Please
telephone or email the sender and delete the message entirely from your system. No binding
obligations or payment commitments are to be derived from the contents of this email unless
and until a clear written agreement containing all the necessary terms and conditions is properly

Jagex Limited is a company registered in England & Wales with company number 03982706
and a registered office at St John's Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.
View raw message