Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1CEF69334 for ; Thu, 11 Oct 2012 13:57:20 +0000 (UTC) Received: (qmail 90073 invoked by uid 500); 11 Oct 2012 13:57:15 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 89720 invoked by uid 500); 11 Oct 2012 13:57:14 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 89711 invoked by uid 99); 11 Oct 2012 13:57:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Oct 2012 13:57:14 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FROM_12LTRDOM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of eczech52@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Oct 2012 13:57:07 +0000 Received: by mail-la0-f48.google.com with SMTP id u2so1354485lag.35 for ; Thu, 11 Oct 2012 06:56:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:from:date:x-google-sender-auth:message-id :subject:to:content-type; bh=csrc+mekzZpOIxLcx8+utI9QkS2DHAR5MHtvAI/SgOI=; b=e2yXCa9KTlFKiRtE8kIndM48xta1AG8iltj/OKd+JqMBego/Rd//6Qh3+IIDwlxDXu 5G3ui7ROYpFWBHDt0jLjy3CuD9jsJuhOAD3yHOsrp7gH4bR44g/yYUVzbhHgRvbfk2K+ F1rC0IioHINWe0BRnCm6nqspW20WDFQnZ/ASK3TBZ37uQVl7ulSV858DN+sFAF0WWFrU yv+p55+lQ5VkrCMBrqk57yD18791qL8TdAxGUNNulOkc2MUmWVbZb8Zi6HSP8vJr4cHA N7khxJD/JA6Q7WN0p/ee9ytRMtTRM9cKydI6Xvon1Xut+dIhpjnFPQDN+NnID8c6H31q BZLQ== Received: by 10.112.30.230 with SMTP id v6mr537285lbh.18.1349963805924; Thu, 11 Oct 2012 06:56:45 -0700 (PDT) MIME-Version: 1.0 Sender: eczech52@gmail.com Received: by 10.152.14.42 with HTTP; Thu, 11 Oct 2012 06:56:25 -0700 (PDT) From: Eric Czech Date: Thu, 11 Oct 2012 09:56:25 -0400 X-Google-Sender-Auth: Jzd5X51_1yDpQxOWxQlFkgToVvI Message-ID: Subject: Managing index generation processes To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=f46d04016d4b1c9e0304cbc8f1bb X-Virus-Checked: Checked by ClamAV on apache.org --f46d04016d4b1c9e0304cbc8f1bb Content-Type: text/plain; charset=ISO-8859-1 Hi everyone, Are there any tools or libraries for managing HDFS files that are used solely for the purpose of creating indexes in HBase? In other words, is there any way to seamlessly integrate new HDFS files into a periodic MapReduce process that builds indexes and also reprocess those files if the index building logic or underlying HDFS files change? I'm looking for something similar to HCatalog but the limitation I find with it is that there's no way to rebuild parts of an index with out deleting the old index entries or having to guarantee that the new index cells will completely overwrite the old ones. Here's an example to better explain: - Assume I want to build an index in HBase on HDFS files A, B, and C. - Let's say I build that index with a MapReduce job and then realize that one of the auxiliary lookup files used in that job was not completely correct. - I'd like to rerun the indexing job at this point but it's entirely possible that the new index won't involve all the same cells as the old index. - Now, I can't delete all the old index entries before running the new job since that index may still be in use so there's no obvious way to update the index in isolation The prevailing approach to solving this seems to be continually rebuilding the indexes in full and having a way to atomically switch the old indexes out with the new ones. A better approach might be to do the same thing with a higher granularity and what I'm really asking is whether or not there is any tool that does exactly that. A naive approach at "versioning" like this with higher granularity might simply tie HDFS files to cells in HBase, give that association a version number, and allow clients to only read cells from hbase associated with active versions (as opposed to versions that are currently being inserted into HBase). Then the "active" version could be incremented at the end of a successful MapReduce index build for all files used in that job. If there are no existing tools for something like this, then doing what I mentioned above is probably the route I'll take and I'm very curious to hear if others are facing similar problems and whether or not a tool to solve them would be more widely beneficial. Thank you for your time and I apologize if this might be a better question for the hbase users list. - Eric --f46d04016d4b1c9e0304cbc8f1bb Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi everyone,

Are there any tools or libraries for managi= ng HDFS files that are used solely for the purpose of creating indexes in H= Base? =A0In other words, is there any way to seamlessly integrate new HDFS = files into a periodic MapReduce process that builds indexes and also reproc= ess those files if the index building logic or underlying HDFS files change= ?

I'm looking for something similar to HCatalog but t= he limitation I find with it is that there's no way to rebuild parts of= an index with out deleting the old index entries or having to guarantee th= at the new index cells will completely overwrite the old ones.

Here's an example to better explain:

=
- =A0Assume I want to build an index in HBase on HDFS files A, B= , and C. =A0
- =A0Let's say I build that index with a MapRedu= ce job and then realize that one of the auxiliary lookup files used in that= job was not completely correct. =A0
- =A0I'd like to rerun the indexing job at this point but it's= entirely possible that the new index won't involve all the same cells = as the old index.
- =A0Now, I can't delete all the old index = entries before running the new job since that index may still be in use so = there's no obvious way to update the index in isolation

The prevailing approach to solving this seems to be con= tinually rebuilding the indexes in full and having a way to atomically swit= ch the old indexes out with the new ones. =A0A better approach might be to = do the same thing with a higher granularity and what I'm really asking = is whether or not there is any tool that does exactly that.

A naive approach at "versioning" like this wi= th higher granularity might simply tie HDFS files to cells in HBase, give t= hat association a version number, and allow clients to only read cells from= hbase associated with active versions (as opposed to versions that are cur= rently being inserted into HBase). =A0Then the "active" version c= ould be incremented at the end of a successful MapReduce index build for al= l files used in that job.

If there are no existing tools for something like this,= then doing what I mentioned above is probably the route I'll take and = I'm very curious to hear if others are facing similar problems and whet= her or not a tool to solve them would be more widely beneficial.

Thank you for your time and I apologize if this might b= e a better question for the hbase users list.

- Er= ic

--f46d04016d4b1c9e0304cbc8f1bb--