hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devopam Mittra <devo...@gmail.com>
Subject Re: Hive - regexp_replace function for multiple strings
Date Wed, 04 Feb 2015 01:32:20 GMT
hi Viral,
Unless you are strictly bound to change the text to achieve your
objectives... you may actually wish to explore ngrams and context_ngram
combinations to uniquely identify the patterns you want to explore and move
them to a new table for further processinng

Better do it at file level on Unix for faster and cleaner results , if it
has to be done the replacing way only.

regards
Devopam


On Wed, Feb 4, 2015 at 3:25 AM, Pradeep Gollakota <pradeepg26@gmail.com>
wrote:

> I don't think this is doable using the out of the box regexp_replace()
> UDF. That way I would do it, is using a file to create a mapping between a
> regexp and it's replacement and write a custom UDF that loads this file and
> applies all regular expressions on the input.
>
> Hope this helps.
>
> On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh <viral.j.parikh@gmail.com>
> wrote:
>
>> Hi Everyone,
>>
>> I am using hive 0.13! I want to find multiple tokens like "hip hop" and
>> "rock music" in my data and replace them with "hiphop" and "rockmusic" -
>> basically replace them without white space. I have used the regexp_replace
>> function in hive. Below is my query and it works great for above 2 examples.
>>
>> drop table vp_hiphop;
>> create table vp_hiphop asselect userid, ntext,
>>        regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic')
as ntext1from  vp_nlp_protext_males;
>>
>> But I have 100 such bigrams/ngrams and want to be able to do replace
>> efficiently where I just remove the whitespace. I can pattern match the
>> phrase - hip hop and rock music but in the replace I want to simply trim
>> the white spaces. Below is what I tried. I also tried using trim with
>> regexp_replace but it wants the third argument in the regexp_replace
>> function.
>>
>> drop table vp_hiphop;
>> create table vp_hiphop asselect  userid, ntext,
>>         regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1from  vp_nlp_protext_males;
>>
>>
>


-- 
Devopam Mittra
Life and Relations are not binary

Mime
View raw message