hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghunath, Ranjith" <>
Subject RE: Regex and serde with hive
Date Fri, 23 Dec 2011 20:02:29 GMT
Thanks Mark, Loren, and Vijay for helping out with this. I did try the serde configuration
and it worked well. I did come across the performance implications as I increased the volume
so I went ahead and cleansed the data prior to inserting it. Thanks again. 

Thank you, 

-----Original Message-----
From: Mark Grover [] 
Sent: Friday, December 23, 2011 9:29 AM
Subject: Re: Regex and serde with hive

Hi Ranjith,
Like Loren, I don't think the regex you are using is correct.

If you use a create table command like the following, it should work:
create external table
my_table(a STRING, b STRING, c STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
"input.regex" = "\"([^\"]*)\"~\"([^\"]*)\"~\"([^\"]*)\"",
"output.format.string" = "a:%1$s,b:%2$s,c:%3$s"

In general, it's good practice to test your regex before using it in the create table statement.
Write a small test or use a website like this (
to test your regex on the data.

Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of performance) and
you might at a later stage want to re-think of what other SerDe's you can use.

As far as using "FIELDS TERMINATED BY" goes, you could use something like FIELDS TERMINATED
BY ~ (or "~"), but you will still have to take care of the double quote(s).


Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: www:

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.

----- Original Message -----
From: "Loren Siebert" <>
Sent: Friday, December 23, 2011 2:27:15 AM
Subject: Re: Regex and serde with hive

The input regexp does not look right to me. You are expecting a space between groups, but
your example contains no spaces. And where do you handle the first/last quotes? Wouldn’t
it look more like this:
"input.regex" = “\"([^\"~]*)[\"~]*([^\"~]*)[\"~]*([^\"~]*)\""

Rather than trying to tackle it all at once, I find it easier to start with a table of one
column and then build up from there until I have all my columns.

On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote:

I have been struggling with this for a while so I would appreciate any advice that you any
of you may have.

I have a file of the format


I created the following table definition to get the data loaded

(f1 string,
f2 string,
f3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
"input.regex" = "([^\"~]*) ([^\"~]*) ([^\"~]*)?",
"output.format.string" = "%1$s %2$s %3$s");

When I load the data in and try to perform a select get NULL values. Thanks again.

Thank you,

View raw message