Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B7F2B748B for ; Sun, 18 Sep 2011 17:46:42 +0000 (UTC) Received: (qmail 619 invoked by uid 500); 18 Sep 2011 17:46:42 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 592 invoked by uid 500); 18 Sep 2011 17:46:42 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Delivered-To: moderator for dev@hbase.apache.org Received: (qmail 43773 invoked by uid 99); 18 Sep 2011 11:10:20 -0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tovbinm@gmail.com designates 209.85.210.169 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:from:date:x-google-sender-auth:message-id :subject:to:content-type; bh=aD1gG2lXYqeKNoB89uWTFf6SRHoYzClhTEwQGrNzRMs=; b=j8O/lQ5ZGVC2FG0cOU3Sc63FpdQlDcOlLugkh9Td+S3iCkLslCwvg6k6Fjg+Bwqmf5 sjMRh/p1Nxz0m+h6u/lUuME1cjOp79b15Urnc90RWWeW/fgLPCpHLWs8dwhNh/SFcHVM TDtq4Mkb6eTMGSP8M4CAaA/LsOXPbgjeniOhM= MIME-Version: 1.0 Sender: tovbinm@gmail.com From: Matthew Tovbin Date: Sun, 18 Sep 2011 14:09:34 +0300 X-Google-Sender-Auth: gckVD81MGeSa7WlCugtcdbXfwtM Message-ID: Subject: Hbase-Hive integration performance issues To: user@hbase.apache.org, dev@hbase.apache.org Content-Type: multipart/alternative; boundary=00151774158617535504ad3544de --00151774158617535504ad3544de Content-Type: text/plain; charset=ISO-8859-1 Hi guys, I've got a table in Hbase let's say "tbl" and I would like to query it using Hive. Therefore I mapped a table to hive as follows: CREATE EXTERNAL TABLE tbl(id string, data map) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:") TBLPROPERTIES("hbase.table.name" = "tbl"); Queries like: "select * from tbl", "select id from tbl", "select id, data from tbl" are really fast. But queries like "select id from tbl where substr(id, 0, 5) = "12345"" or "select id from tbl where data["777"] IS NOT NULL" are incredibly slow. In the contrary when running from Hbase shell: "scan 'tbl', { COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan 'tbl', { COLUMNS=>'data', "FILTER" => FilterList.new([qualifierFilter('777')])}" it is lightning fast! When I looked into the mapred job generated by hive on jobtracker I discovered that "map.input.records" counts ALL the items in Hbase table, meaning the job makes a full table scan before it even starts any mappers!! Moreover, I suspect it copies all the data from Hbase table to hdfs to mapper tmp input folder before executuion. So, my questions are - Why hbase storage handler for hive does not translate hive queries into appropriate hbase functions? Why it scans all the records and then slices them using "where" clause? How can it be improved? Is Pig's integration better in this case? Some additional information about the tables: Table description in Hbase: jruby-1.6.2 :011 > describe 'tbl' DESCRIPTION ENABLED {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER => 'ROWCOL', REPLICATIO true N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} Table desciption in Hive: hive> describe tbl; OK id string from deserializer data map from deserializer Time taken: 0.08 seconds Best regards, Matthew Tovbin =) --00151774158617535504ad3544de--