Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of drujensen@gmail.com designates
 74.125.44.29 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:from:to:in-reply-to:content-type
         :content-transfer-encoding:mime-version:subject:date:references
         :x-mailer;
        b=t4mXGeuoUFY8YCxJJ/xpRKQiVZbvQU15uQwii00jBJsINKRbiGMj9bQaGhmATDkHQj
         xEAmNn/Lk/OcZaGmTQ2W5KcDydhm+rQTQwbvVC7w0xcci+P13dDtswZQxPJFbAlncYE7
         J0mEEu2zTu3sTIOOWLhvoRHu97PaFiZ5edYug=
Message-Id: <4BA2C29C-5F51-4670-8E45-6F977F2333BE@gmail.com>
From: Dru Jensen <drujensen@gmail.com>
To: hbase-user@hadoop.apache.org
In-Reply-To: <7a8854060901101032j7a4d3a01y652dd502bbf1f796@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Subject: Re: Accessing rows with number indexes
Date: Sat, 10 Jan 2009 16:01:28 -0800
References: <7a8854060901092056n37be4b09mcb5f39b0f06e92c7@mail.gmail.com>
 <92c4d8c10901100955r40d0a3fse3b83ab9b8d3fa02@mail.gmail.com>
 <7a8854060901101032j7a4d3a01y652dd502bbf1f796@mail.gmail.com>

I'm not sure this will work or a good idea but is it possible to use  
the tableindexed feature in 0.19 and create an IndexKeyGenerator that  
does an auto increment?

http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/client/tableindexed/package.html?view=markup


On Jan 10, 2009, at 10:32 AM, Jim Twensky wrote:

> Unfortunately, yes the sentences need to be sorted. I take advantage  
> of the
> lexicographical ordering of the sentences for some other purpose.  
> Even if I
> didn't, how could I generate the prefixes? Do you mean number prefixes
> should be in the range [1-n] where n is the number of rows in the  
> table?
> Since I use Hadoop to pull the data in, I can't see a trivial way to
> generate number prefixes but I may be missing something obvious.
>
> Jim
>
> On Sat, Jan 10, 2009 at 11:55 AM, Tim Sell <trsell@gmail.com> wrote:
>
>> Do the sentences need to be sorted?
>> if not you could use an number prefix on the row key. Keep track of
>> the highest prefix and use that range to select a prefix randomly.
>> Then start a scanner at that prefix
>>
>> ~Tim.
>>
>> 2009/1/10 Jim Twensky <jim.twensky@gmail.com>:
>>> Hello,
>>>
>>> I have an HBase table that contains sentences as row keys and a few
>> numeric
>>> values as columns. A simple abstract model of the table looks like  
>>> the
>>> following:
>>>
>>>
>> --------------------------------------------------------------------------------------------------------------------------
>>> Sentence     |          frequency:value     |       
>>> probability:value-0
>>> |     probability:value-2
>>>
>> --------------------------------------------------------------------------------------------------------------------------
>>> Hello World |                 5                    |       
>>> 0.000545321
>>> |     0.002368204
>>>    .                              .
>>> .                             .
>>>    .                              .
>>> .                             .
>>>    .                              .
>>> .                             .
>>>
>> --------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> I create the table and load it using Hadoop and there are hundreds  
>>> of
>>> billions of entries in it. I use this table to solve an optimization
>> problem
>>> using a hill climbing/simulated annealing method. Basically, I  
>>> need to
>>> change the likelihood values randomly. For example, I need to  
>>> change say
>> the
>>> first 5 rows starting at the 112th row and do some calculations  
>>> and so
>> on...
>>>
>>> Now the problem is, I can't see an easy way to access to the n'th  
>>> row
>>> directly. If I was using a traditional RDBMS, I'd add another  
>>> column and
>>> auto-increment it each time I added a new row but this is not  
>>> possible
>> since
>>> I load the table using Hadoop and the there are parallel insertions
>> taking
>>> place simultaneously. A quick and dirty way to do this might be  
>>> adding a
>> new
>>> index column after I load and initialize the table but the table  
>>> is huge
>> and
>>> it doesn't seem right to me. Another bad approach would be to use a
>> scanner
>>> starting from the first row and calling Scanner.next() n times  
>>> inside a
>> for
>>> loop to access the n'th row, which also seems very slow. Any ideas  
>>> on how
>> I
>>> could do it more efficiently?
>>>
>>> Thanks in advance,
>>> Jim
>>>
>>