Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 26050100F1 for ; Wed, 18 Dec 2013 07:31:31 +0000 (UTC) Received: (qmail 48324 invoked by uid 500); 18 Dec 2013 07:31:28 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 48282 invoked by uid 500); 18 Dec 2013 07:31:28 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 48274 invoked by uid 99); 18 Dec 2013 07:31:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Dec 2013 07:31:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.214.171] (HELO mail-ob0-f171.google.com) (209.85.214.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Dec 2013 07:31:20 +0000 Received: by mail-ob0-f171.google.com with SMTP id wp18so7500066obc.2 for ; Tue, 17 Dec 2013 23:30:58 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=0Pe67k4r50OIK0egV420Nh6Qwl5fhbR0gN85YYVzdhg=; b=g3utsgzwTp11uIWML3iofCmRe0Tft7V3c+HWW/22OPLuU5N6idgmHFzyxp5a6KmNDo 6FBsyK6TW7Jv22OVnzJnwwzXE9ut+sFZG45xFk9e/520KKEq9turL0Pa2v6hNCzNNrIi sf3SKPPLwD1Y4RIfw3jHKyEOA3bc0RZNhHU2cYkn6L2lLDMQPhjo6yZpMg0MuBIT7BN8 fdvhmxjiY5bdl31+kDkfAjbJctqjfIIjwYOOG1flxJLkwSeoMRJpJkZkwihmRNsfmm8Q czU+2Ry25dljcp/61NkwlTyTojIgS8pck9k/xXcOI0gafQsW3xqYjDhFvGUS8DChi4RZ dH6Q== X-Gm-Message-State: ALoCoQlKIxSEJ1E2XXHMPXYRDO+rgW7D9TCrFX6KJ8C2pZRhYTiqoqdWH7Q2GN3o2wEQczKPQfpX MIME-Version: 1.0 X-Received: by 10.60.136.196 with SMTP id qc4mr19695575oeb.41.1387351858416; Tue, 17 Dec 2013 23:30:58 -0800 (PST) Received: by 10.76.11.138 with HTTP; Tue, 17 Dec 2013 23:30:58 -0800 (PST) In-Reply-To: References: Date: Wed, 18 Dec 2013 16:30:58 +0900 Message-ID: Subject: Re: Pointing multiple external tables to the same location From: =?EUC-KR?B?TmF2aXO3+b3Cv+w=?= To: "user@hive.apache.org" Content-Type: multipart/alternative; boundary=047d7b33962fb31d7804edca061f X-Virus-Checked: Checked by ClamAV on apache.org --047d7b33962fb31d7804edca061f Content-Type: text/plain; charset=ISO-8859-1 Hive uses path to table(or partition) mapping internally (you can see that in MapredWork, etc.), which might caused first table overwritten by other. I didn't tried symlink on hdfs, which could be a solution. 2013/12/12 Petter von Dolwitz (Hem) > Hi, > > I have declared several external tables pointing to the same location. The > things that tells these tables apart (apart from their names) is that they > have unique properties. These properties help me choose the correct rows > from the underlying file. I use a single storage handler (accompanied by a > single InputFormat and a single Serde) . The first columns in all tables > are the same but the last (a struct) is unique and > is constructed from the Serde (with help of the serde properties). A > simplified version of the tables look like so: > > CREATE EXTERNAL TABLE Table1 ( > column1 STRING, > column2 STRING) > STORED BY 'MyStorageHandler' > WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1') > LOCATION 'mylocation' > TBLPROPERTIES('recordreader.filter'='table1_filter'); > > CREATE EXTERNAL TABLE Table2 ( > column1 STRING, > column2 STRING) > STORED BY 'MyStorageHandler' > WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2') > LOCATION 'mylocation' > TBLPROPERTIES('recordreader.filter'='table2_filter'); > > > All works well for simple select queries towards the two tables. The > following query gives very strange results though: > > SELECT * FROM ( > SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue' > union all > SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue' > ) my_union > ORDER BY my_union.column1 > > > It seems like one job task is created per file stored in the table > location. This task gets the table properties from the second table and in > the SerDe-step later on it seems like the records gets mixed up. > > I would have expected that hive would need to iterated the source files > two times using two different tasks (with the correct table properties > passed) in order to get this to work. > > Anyone here that can shed some light on this scenario? > > Thanks, > Petter > > > > > > > --047d7b33962fb31d7804edca061f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hive uses path to table(or partition) mapping internally (= you can see that in MapredWork, etc.), which might caused=A0first table=A0o= verwritten by other.=A0

I didn't tried symlink on hdfs, wh= ich could be a solution.



2013/12/12 Petter von Dolwitz (Hem) <petter.von.d= olwitz@gmail.com>
Hi,
=
I have declared several external tables pointing to the same loca= tion. The things that tells these tables apart (apart from their names) is = that they have unique properties. These properties help me choose the corre= ct rows from the underlying file. I use a single storage handler (accompani= ed by a single InputFormat and a single Serde) . The first columns in all t= ables are the same but the last (a struct) is unique and
is constructed from the Serde (with help of the serde properties). A simpli= fied version of the tables look like so:

CREATE EXTERNAL TABLE Table= 1 (
=A0 column1 STRING,
=A0 column2 STRING)
=A0 STORED BY 'MyS= torageHandler'
=A0 WITH SERDEPROPERTIES ('ser.class'=3D'MyStructSerializationC= lass1')
=A0 LOCATION 'mylocation'
=A0 TBLPROPERTIES('= recordreader.filter'=3D'table1_filter');

CREATE EXTERNAL= TABLE Table2 (
=A0 column1 STRING,
=A0 column2 STRING)
=A0 STORED BY 'MyStorageH= andler'
=A0 WITH SERDEPROPERTIES ('ser.class'=3D'MyStruc= tSerializationClass2')
=A0 LOCATION 'mylocation'
=A0 TBLP= ROPERTIES('recordreader.filter'=3D'table2_filter');


All works well for simple select queries towards the two tabl= es. The following query gives very strange results though:

SELECT * = FROM (
=A0 SELECT column1,'Table1' FROM Table1 WHERE column2 =3D= 'myValue'
=A0 union all
=A0 SELECT column1,'Table2' FROM Table2 WHERE colu= mn2 =3D 'myValue'
=A0 ) my_union
ORDER BY my_union.column1

It seems like one job task is created per file stored in the= table location. This task gets the table properties from the second table = and in the SerDe-step later on it seems like the records gets mixed up.

I would have expected that hive would need to iterated the source= files two times using two different tasks (with the correct table properti= es passed) in order to get this to work.

Anyone= here that can shed some light on this scenario?

Thanks,
Petter







--047d7b33962fb31d7804edca061f--