hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghunath, Ranjith" <Ranjith.Raghuna...@usaa.com>
Subject RE: Regex and serde with hive
Date Fri, 23 Dec 2011 20:02:29 GMT
Thanks Mark, Loren, and Vijay for helping out with this. I did try the serde configuration
and it worked well. I did come across the performance implications as I increased the volume
so I went ahead and cleansed the data prior to inserting it. Thanks again. 

Thank you, 
Ranjith 

-----Original Message-----
From: Mark Grover [mailto:mgrover@oanda.com] 
Sent: Friday, December 23, 2011 9:29 AM
To: user@hive.apache.org
Subject: Re: Regex and serde with hive

Hi Ranjith,
Like Loren, I don't think the regex you are using is correct.

If you use a create table command like the following, it should work:
create external table
my_table(a STRING, b STRING, c STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
"input.regex" = "\"([^\"]*)\"~\"([^\"]*)\"~\"([^\"]*)\"",
"output.format.string" = "a:%1$s,b:%2$s,c:%3$s"
)
STORED AS TEXTFILE LOCATION 'my_location';

In general, it's good practice to test your regex before using it in the create table statement.
Write a small test or use a website like this (http://www.regexplanet.com/simple/index.html)
to test your regex on the data.

Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of performance) and
you might at a later stage want to re-think of what other SerDe's you can use.

As far as using "FIELDS TERMINATED BY" goes, you could use something like FIELDS TERMINATED
BY ~ (or "~"), but you will still have to take care of the double quote(s).

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgrover@oanda.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.


----- Original Message -----
From: "Loren Siebert" <loren@siebert.org>
To: user@hive.apache.org
Sent: Friday, December 23, 2011 2:27:15 AM
Subject: Re: Regex and serde with hive


The input regexp does not look right to me. You are expecting a space between groups, but
your example contains no spaces. And where do you handle the first/last quotes? Wouldn’t
it look more like this:
"input.regex" = “\"([^\"~]*)[\"~]*([^\"~]*)[\"~]*([^\"~]*)\""


Rather than trying to tackle it all at once, I find it easier to start with a table of one
column and then build up from there until I have all my columns.


On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote:





I have been struggling with this for a while so I would appreciate any advice that you any
of you may have.

I have a file of the format

“Xyz”~”qsd”~”1234”

I created the following table definition to get the data loaded

CREATE TABLE dummy
(f1 string,
f2 string,
f3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^\"~]*) ([^\"~]*) ([^\"~]*)?",
"output.format.string" = "%1$s %2$s %3$s");

When I load the data in and try to perform a select get NULL values. Thanks again.

Thank you,
Ranjith

Mime
View raw message