Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BDCD11083F for ; Mon, 27 Jan 2014 04:26:01 +0000 (UTC) Received: (qmail 15337 invoked by uid 500); 27 Jan 2014 04:25:59 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 15007 invoked by uid 500); 27 Jan 2014 04:25:58 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 14996 invoked by uid 99); 27 Jan 2014 04:25:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jan 2014 04:25:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.214.173] (HELO mail-ob0-f173.google.com) (209.85.214.173) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jan 2014 04:25:52 +0000 Received: by mail-ob0-f173.google.com with SMTP id vb8so5930769obc.18 for ; Sun, 26 Jan 2014 20:25:31 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=0dHVUmV4ieW6E8He+5vRRnzU09Qz15wJOhHtgHKKJo0=; b=R2oq7bIl80Awlzro4wgAllAISqq9hDYbbX9BiVLbrq4+OZnDo892QIXYY7cnVgh754 dmmB6nnOudfVi9xqoEHJ1YZUQtTk7UgiEwjguKBfTbUhq2Egs67/7OEN/Erlc6XE4bgL nMiJn/jKmEGFH0dtd2GtFr6dWibJbp3Iqsp2lOluJKcwgwDSNl79bZWznAKMg1RQ/ZBp NO9Z8LydhakDoL71IkSH7JI2kzgMl5whRhjhVDn3+QQ0jXzKd+2a7c0q+VsXgLrxgd59 VsgdgZuerOy14Pu0+b05OqA3uG+Tws9ATEPg3F7CNyiRhft3CZOSb8dlsITF6tH8z2XG trlA== X-Gm-Message-State: ALoCoQm4h6n3BWZ2KFXymORNnPLfHLFf4vVLYL7p/voq0QWuegHzxi7/rjF5vO8cdSeyJ3lF2JWF MIME-Version: 1.0 X-Received: by 10.182.131.170 with SMTP id on10mr6458249obb.2.1390796731514; Sun, 26 Jan 2014 20:25:31 -0800 (PST) Received: by 10.76.125.198 with HTTP; Sun, 26 Jan 2014 20:25:31 -0800 (PST) In-Reply-To: References: Date: Mon, 27 Jan 2014 13:25:31 +0900 Message-ID: Subject: Re: Using Hive metastore as general purpose RDBMS From: =?EUC-KR?B?TmF2aXO3+b3Cv+w=?= To: "user@hive.apache.org" Content-Type: multipart/alternative; boundary=001a11c1e676233be804f0ec19a4 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c1e676233be804f0ec19a4 Content-Type: text/plain; charset=ISO-8859-1 I've heard similar use cases from NCSoft(A big game company in Korea) platform team but they might use their own StorageHandler (and HiveStoragePredicateHandler) implementation. We might introduce new injection point for partition pruning, If you can implement the logic via an interface similar with HiveStoragePredicateHandler. public interface HiveStoragePredicateHandler { public DecomposedPredicate decomposePredicate( JobConf jobConf, Deserializer deserializer, ExprNodeDesc predicate); } 2014-01-23 Petter von Dolwitz (Hem) > Hi Alan, > > thank you for your reply. The loose idea I had was to store one row in the > RDBMS per Hive partition so I don't think the size will be an issue > (expecting 3000 partitions or so). The end goal was to help to decide which > partitions that are relevant for a query. Something like adding partition > info to the WHERE clause behind the scenes. The way the data is structured > we currently need to look up which partitions to use elsewhere. > > I'll look into ORC for sure. Currently we do not use any of the provided > file formats but have implemented our own InputFormat that read gzip:ed > protobufs. I suspect that we later on should investigate a possible > performance gain coming from moving to a another file format. > > Petter > > > 2014/1/22 Alan Gates > >> HCatalog is definitely not designed for this purpose. Could you explain >> your use case more fully? Is this indexing for better query planning or >> faster file access? If so, you might look at some of the work going on in >> ORC, which is storing indices of its data in the format itself for these >> purposes. Also, how much data do you need to store? Even index size on a >> Hadoop scale data can quickly overwhelm MySQL or Postgres (which is what >> most people use for their metastores) if you are keeping per row >> information. If you truly want to access an RDBMS as if it were an >> external data store, you could implement a HiveStorageHandler for your >> RDBMS. >> >> Alan. >> >> On Jan 22, 2014, at 2:02 AM, Petter von Dolwitz (Hem) < >> petter.von.dolwitz@gmail.com> wrote: >> >> > Hi, >> > >> > I have a case where I would like to extend Hive to use information from >> a regular RDBMS. To limit the complexity of the installation I thought I >> could piggyback on the already existing metatstore. >> > >> > As I understand it, HCatalog is not built for this purpose. Is there >> someone out there that has a similar usecase or have any input on how this >> is done or if it should be avoided? >> > >> > The use case is to look up which partitions that contain certain data. >> > >> > Thanks, >> > Petter >> > >> > >> >> >> -- >> CONFIDENTIALITY NOTICE >> NOTICE: This message is intended for the use of the individual or entity >> to >> which it is addressed and may contain information that is confidential, >> privileged and exempt from disclosure under applicable law. If the reader >> of this message is not the intended recipient, you are hereby notified >> that >> any printing, copying, dissemination, distribution, disclosure or >> forwarding of this communication is strictly prohibited. If you have >> received this communication in error, please contact the sender >> immediately >> and delete it from your system. Thank You. >> > > --001a11c1e676233be804f0ec19a4 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I've heard similar use cases from NCSoft(A big game co= mpany in Korea) platform team but they might use their own StorageHandler (= and=A0HiveStoragePredicateHandler)=A0implementation.

We = might introduce new injection point for partition pruning, If you can imple= ment the logic via an interface similar with HiveStoragePredicateHandler.

public interface HiveStoragePredicateHandler {
=A0 public DecomposedPredicate decomposePredicate(
=A0= =A0 JobConf jobConf,
=A0 =A0 Deserializer deserializer,
=A0 =A0 ExprNodeDesc predicate);
}


2014-01-23 Petter von Dolwitz (Hem) <petter.von.dolw= itz@gmail.com>
Hi Alan,

thank you for your reply. The loose idea I had was to store one row in the= RDBMS per Hive partition so I don't think the size will be an issue (e= xpecting 3000 partitions or so). The end goal was to help to decide which p= artitions that are relevant for a query. Something like adding partition in= fo to the WHERE clause behind the scenes. The way the data is structured we= currently need to look up which partitions to use elsewhere.

I'll look into ORC for sure. Currently we do not us= e any of the provided file formats but have implemented our own InputFormat= that read gzip:ed protobufs. I suspect that we later on should investigate= a possible performance gain coming from moving to a another file format.

Petter
<= div class=3D"h5">


2014/1/22 Alan Gates <gates@hortonworks.com>
HCatalog is definitely not designed for this= purpose. =A0Could you explain your use case more fully? =A0Is this indexin= g for better query planning or faster file access? =A0If so, you might look= at some of the work going on in ORC, which is storing indices of its data = in the format itself for these purposes. =A0Also, how much data do you need= to store? =A0Even index size on a Hadoop scale data can quickly overwhelm = MySQL or Postgres (which is what most people use for their metastores) if y= ou are keeping per row information. =A0If you truly want to access an RDBMS= as if it were an external data store, you could implement a HiveStorageHan= dler for your RDBMS.

Alan.

On Jan 22, 2014, at 2:02 AM, Petter von Dolwitz (Hem) <petter.von.dolwitz@gmail.c= om> wrote:

> Hi,
>
> I have a case where I would like to extend Hive to use information fro= m a regular RDBMS. To limit the complexity of the installation I thought I = could piggyback on the already existing metatstore.
>
> As I understand it, HCatalog is not built for this purpose. Is there s= omeone out there that has a similar usecase or have any input on how this i= s done or if it should be avoided?
>
> The use case is to look up which partitions that contain certain data.=
>
> Thanks,
> Petter
>
>


--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to=
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that=
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately=
and delete it from your system. Thank You.


--001a11c1e676233be804f0ec19a4--