Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of eczech52@gmail.com designates
 209.85.215.48 as permitted sender)
MIME-Version: 1.0
Sender: eczech52@gmail.com
From: Eric Czech <eric@nextbigsound.com>
Date: Thu, 11 Oct 2012 09:56:25 -0400
Message-ID: 
 <CACjoVs+4Bv3o5A=h27OMJm+cnHSWhvyZFVpNyp5Onh8=HZka+w@mail.gmail.com>
Subject: Managing index generation processes
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=f46d04016d4b1c9e0304cbc8f1bb

--f46d04016d4b1c9e0304cbc8f1bb
Content-Type: text/plain; charset=ISO-8859-1

Hi everyone,

Are there any tools or libraries for managing HDFS files that are used
solely for the purpose of creating indexes in HBase?  In other words, is
there any way to seamlessly integrate new HDFS files into a periodic
MapReduce process that builds indexes and also reprocess those files if the
index building logic or underlying HDFS files change?

I'm looking for something similar to HCatalog but the limitation I find
with it is that there's no way to rebuild parts of an index with out
deleting the old index entries or having to guarantee that the new index
cells will completely overwrite the old ones.

Here's an example to better explain:

-  Assume I want to build an index in HBase on HDFS files A, B, and C.
-  Let's say I build that index with a MapReduce job and then realize that
one of the auxiliary lookup files used in that job was not completely
correct.
-  I'd like to rerun the indexing job at this point but it's entirely
possible that the new index won't involve all the same cells as the old
index.
-  Now, I can't delete all the old index entries before running the new job
since that index may still be in use so there's no obvious way to update
the index in isolation

The prevailing approach to solving this seems to be continually rebuilding
the indexes in full and having a way to atomically switch the old indexes
out with the new ones.  A better approach might be to do the same thing
with a higher granularity and what I'm really asking is whether or not
there is any tool that does exactly that.

A naive approach at "versioning" like this with higher granularity might
simply tie HDFS files to cells in HBase, give that association a version
number, and allow clients to only read cells from hbase associated with
active versions (as opposed to versions that are currently being inserted
into HBase).  Then the "active" version could be incremented at the end of
a successful MapReduce index build for all files used in that job.

If there are no existing tools for something like this, then doing what I
mentioned above is probably the route I'll take and I'm very curious to
hear if others are facing similar problems and whether or not a tool to
solve them would be more widely beneficial.

Thank you for your time and I apologize if this might be a better question
for the hbase users list.

- Eric

--f46d04016d4b1c9e0304cbc8f1bb
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi everyone,<div><br></div><div>Are there any tools or libraries for managi=
ng HDFS files that are used solely for the purpose of creating indexes in H=
Base? =A0In other words, is there any way to seamlessly integrate new HDFS =
files into a periodic MapReduce process that builds indexes and also reproc=
ess those files if the index building logic or underlying HDFS files change=
?</div>

<div><br></div><div>I&#39;m looking for something similar to HCatalog but t=
he limitation I find with it is that there&#39;s no way to rebuild parts of=
 an index with out deleting the old index entries or having to guarantee th=
at the new index cells will completely overwrite the old ones.</div>

<div><br></div><div>Here&#39;s an example to better explain:</div><div><br>=
</div><div>- =A0Assume I want to build an index in HBase on HDFS files A, B=
, and C. =A0</div><div>- =A0Let&#39;s say I build that index with a MapRedu=
ce job and then realize that one of the auxiliary lookup files used in that=
 job was not completely correct. =A0</div>

<div>- =A0I&#39;d like to rerun the indexing job at this point but it&#39;s=
 entirely possible that the new index won&#39;t involve all the same cells =
as the old index.</div><div>- =A0Now, I can&#39;t delete all the old index =
entries before running the new job since that index may still be in use so =
there&#39;s no obvious way to update the index in isolation</div>

<div><br></div><div>The prevailing approach to solving this seems to be con=
tinually rebuilding the indexes in full and having a way to atomically swit=
ch the old indexes out with the new ones. =A0A better approach might be to =
do the same thing with a higher granularity and what I&#39;m really asking =
is whether or not there is any tool that does exactly that.</div>

<div><br></div><div>A naive approach at &quot;versioning&quot; like this wi=
th higher granularity might simply tie HDFS files to cells in HBase, give t=
hat association a version number, and allow clients to only read cells from=
 hbase associated with active versions (as opposed to versions that are cur=
rently being inserted into HBase). =A0Then the &quot;active&quot; version c=
ould be incremented at the end of a successful MapReduce index build for al=
l files used in that job.</div>

<div><br></div><div>If there are no existing tools for something like this,=
 then doing what I mentioned above is probably the route I&#39;ll take and =
I&#39;m very curious to hear if others are facing similar problems and whet=
her or not a tool to solve them would be more widely beneficial.</div>

<div><br></div><div>Thank you for your time and I apologize if this might b=
e a better question for the hbase users list.</div><div><br></div><div>- Er=
ic</div><div><br></div>

--f46d04016d4b1c9e0304cbc8f1bb--