Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 2602 invoked from network); 16 Mar 2010 12:36:42 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Mar 2010 12:36:42 -0000 Received: (qmail 79163 invoked by uid 500); 16 Mar 2010 12:36:39 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 79124 invoked by uid 500); 16 Mar 2010 12:36:39 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 79109 invoked by uid 99); 16 Mar 2010 12:36:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 12:36:39 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of prasen.bea@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 12:36:31 +0000 Received: by vws20 with SMTP id 20so1370960vws.35 for ; Tue, 16 Mar 2010 05:36:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=7nnC7StSl9ybnRyV0hi8gEVjYlrbJryOeSE4IEFiYV0=; b=Md2oncXrQ2tEqUCy98wahwFQmR0owPiWeJ+VINU6FtRXT0YzC/Rorf1A7CnjlTC9nv McvB90+Pj9u6Z2MxEFuAvHOgQuaDYAiDnmFzauZeZGbojSInB+F5sb/pZ/3M0Y0E1H/b Im2GEiBQ99akpVoViO4/DT9YnUCkcOVHLzCeY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=a/QIyQvd86e44xEVWXJWZxhLVvNTJbLuziAE3n5wA3VSB+sIyyZLABKi1lZdgsf37K DG7HEov2xfkG9zhH7IJn7ftRyzzLB2NuB0j7PvXo+Fpmw/ttDfcj9dAwAbCOt3hEs3y+ QCoBsffv1mH52nwkUh+UOYyMrO3dkpjYvnJBc= MIME-Version: 1.0 Received: by 10.220.126.167 with SMTP id c39mr4035789vcs.131.1268742970610; Tue, 16 Mar 2010 05:36:10 -0700 (PDT) In-Reply-To: <500e4a1e1003152317n44682772s22a5b1027f7703b3@mail.gmail.com> References: <500e4a1e1003152317n44682772s22a5b1027f7703b3@mail.gmail.com> Date: Tue, 16 Mar 2010 18:06:10 +0530 Message-ID: <500e4a1e1003160536t7e563517qf63b5e75da363f9b@mail.gmail.com> Subject: Fwd: How to avoid a full table scan for column search. ( HIVE+LUCENE) From: prasenjit mukherjee To: common-user , pig-user Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org forwarding to hdfs and pig mailing-lists for responses from wider audience= . ---------- Forwarded message ---------- From: prasenjit mukherjee Date: Tue, Mar 16, 2010 at 11:47 AM Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE) To: hive-user Is there a way to avoid full table scan for an arbitrary where-clause usage ? partitioning/bucketing makes sense only when you know which columns will be searched upon. I was wondering if there is any project which combines the SQL-like features of HIVE and inverted-index like search-features of LUCENE, and works on cloud. Guess I am asking for too much :( I have been using oracle till now and my usage is mainly restricted to do summation-type queries with some where clause, example being =A0: "Select SUM(column1) =A0where col2=3D'foo' AND col3=3D'bar'". The output is always some aggregation and where clauses can include "<, >, =3D, IN". =A0I would like to use some kind of distributed processing to speed up the table generation, search query-time. Hive ( and to some extent Pig ) seems to be the closest tool available to what I am looking for. I am also exploring hbase, but not sure whether it will be the right choice for my problem. Hive can definitely help in parallelizing up the search-processing. But my main concern is whether hive does ( or plans to do ) any storage optimization like oracle,lucene ( apart from simple partitioning/bucketing ). =A0It seems that all the hadoop-options ( hive,pig,hbase) will have to do =A0an entire table scan. Appreciate any suggestions/feedback.. -thanks, Prasen