hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Omernik <j...@omernik.com>
Subject Re: Table creation for logfile data
Date Sun, 24 Nov 2013 11:55:39 GMT
Put the logile into a location on HDFS, and create an external table
pointing to that location. The External table should just have one column,
a string,
CREATE EXTERNAL TABLE logfile_etl (message STRING) LOCATION '/etl/logfile'

I think that should work.

Then Create another table

CREATE TABLE logfile (ts STRING, ADD STRING, files STRING) PARTITIONED BY
(DAY STRING) STORED AS ORC;

copy files into /etc/logfile

run this hive file:

SET hive.exec.dynamic.partition=true;

SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.max.dynamic.partitions=10000;

SET hive.exec.max.dynamic.partitions.pernode=1000;

INSERT INTO logfile
select substring(message_line, 0, 17) as ts
regexp_extract(message_line, '\\[([^\\]+)\\]') as ADD,
regexp_extract(message_line,'\\] \\[([^\\]]+)\\]') as files,
concat('20', substring(messageline, 0, 8)) as day
from logfile_etl

delete the the files /etl/logfile (or move them to an archival)

That will get you a day partitioned (I added the 20 in front of your date
so that string sorts well, although it probably would without it, it'
early, and I have not had coffee yet) ORC file table (with compression and
ORC good ness. The regexs are a little messy, by based on your one line of
data, should work. Also: If you have data from pre 2000 obviously, the
concat('20' thing needs to be  updated.  Note, I didn't use a regex on the
date... why? It appears to be properly padded data, therefore a substring
is fast. This type of stuff has so many ways to skin a cat, so your way may
be totally different from my way, but this is how I'd approach it long
term. (if it's a one time thing, I may not create the managed the table,
but if so, having partitions and ORC files will make things faster).  If
there are syntax errors I apologize, see earlier disclaimer about lack of
proper bean sourced stimulants.





On Sat, Nov 23, 2013 at 7:36 AM, Baahu <bahubali@gmail.com> wrote:

> Hi,
> I have a messy log file which I want to use to create a table, I am only
> interested to retrieve 3 columns (time,ADD,files),which are in bold.
> Sample entry from log file
> *: 13-11-23 06:23:45 [ADD] [file1.zip|file2.zip] *  junkjunk|2013-11-23
> 06:23:44:592 EST|file3.zip xyz|2013-11-23 06:23:44:592 EST|file3.zip
>
> Can you please let me know how I should go about, regex seems to be way
> out,but I am struggling with that as well !!
>
> Thanks,
> Baahu
>
>

Mime
View raw message