Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2DC438786 for ; Thu, 1 Sep 2011 14:54:17 +0000 (UTC) Received: (qmail 84017 invoked by uid 500); 1 Sep 2011 14:54:11 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 83256 invoked by uid 500); 1 Sep 2011 14:54:10 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 83202 invoked by uid 99); 1 Sep 2011 14:54:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Sep 2011 14:54:09 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of brock@cloudera.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-iy0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Sep 2011 14:54:04 +0000 Received: by iafi7 with SMTP id i7so2916025iaf.35 for ; Thu, 01 Sep 2011 07:53:43 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.45.206 with SMTP id g14mr412712ibf.97.1314888445425; Thu, 01 Sep 2011 07:47:25 -0700 (PDT) Received: by 10.231.180.34 with HTTP; Thu, 1 Sep 2011 07:47:25 -0700 (PDT) In-Reply-To: <32379849.post@talk.nabble.com> References: <32379849.post@talk.nabble.com> Date: Thu, 1 Sep 2011 09:47:25 -0500 Message-ID: Subject: Re: Creating a hive table for a custom log From: Brock Noland To: common-user@hadoop.apache.org Cc: core-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi, On Thu, Sep 1, 2011 at 9:08 AM, Raimon Bosch wrote= : > > Hi, > > I'm trying to create a table similar to apache_log but I'm trying to avoi= d > to write my own map-reduce task because I don't want to have my HDFS file= s > twice. > > So if you're working with log lines like this: > > 186.92.134.151 [31/Aug/2011:00:10:41 +0000] "GET > /client/action1/?transaction_id=3D8002&user_id=3D871793100001248&ts=3D131= 4749223525&item1=3D271&item2=3D6045&environment=3D2 > HTTP/1.1" > > 112.201.65.238 [31/Aug/2011:00:10:41 +0000] "GET > /client/action1/?transaction_id=3D9002&ts=3D1314749223525&user_id=3D90488= 71793100&item2=3D6045&item1=3D271&environment=3D2 > HTTP/1.1" > > 90.45.198.251=A0[31/Aug/2011:00:10:41 +0000] "GET > /client/action2/?transaction_id=3D9022&ts=3D1314749223525&user_id=3D90488= 71793100&item2=3D6045&item1=3D271&environment=3D2 > HTTP/1.1" > > And having in mind that the parameters could be in different orders. Whic= h > will be the best strategy to create this table? Write my own > org.apache.hadoop.hive.contrib.serde2? Is there any resource already > implemented that I could use to perform this task? I would use the regex serde to parse them: CREATE EXTERNAL TABLE access_log (ip STRING, dt STRING, request STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" =3D "([\\d.]+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\"") LOCATION '/path/to/file'; That will parse the three fields out and could be modified to separate out the action. Then I think you will need to parse the query string in Hive itself. > > In the end the objective is convert all the parameters in fields and use = as > type the "action". With this big table I will be able to perform my queri= es, > my joins or my views. > > Any ideas? > > Thanks in Advance, > Raimon Bosch. > -- > View this message in context: http://old.nabble.com/Creating-a-hive-table= -for-a-custom-log-tp32379849p32379849.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >