Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@locus.apache.org Received: (qmail 2893 invoked from network); 10 Jan 2009 23:56:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Jan 2009 23:56:53 -0000 Received: (qmail 75445 invoked by uid 500); 10 Jan 2009 23:56:52 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 75429 invoked by uid 500); 10 Jan 2009 23:56:52 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 75417 invoked by uid 99); 10 Jan 2009 23:56:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Jan 2009 15:56:52 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of drujensen@gmail.com designates 74.125.44.29 as permitted sender) Received: from [74.125.44.29] (HELO yx-out-2324.google.com) (74.125.44.29) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 Jan 2009 23:56:43 +0000 Received: by yx-out-2324.google.com with SMTP id 31so3450271yxl.29 for ; Sat, 10 Jan 2009 15:56:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:references:x-mailer; bh=Dj/8yHu6JyFPdAlkRJtUv6gEHioD7e4s7Jr+2WsHSEI=; b=SCLoqhMwI9yFNBEVhm6VIfqIoWrfrIsiOTnl8oSkyaDRjHgg3hARKIz2duRBbHc1Zk 6MXvaV7l9h3O/t8Xaja/zOyg1dfPiFaODylMe24rzqm72rle0Vuq38vv+D6KQ5yGIfMK 1ernKlyu97msEDpanqWUnjrI4vI0DSeEh4e+o= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:references :x-mailer; b=t4mXGeuoUFY8YCxJJ/xpRKQiVZbvQU15uQwii00jBJsINKRbiGMj9bQaGhmATDkHQj xEAmNn/Lk/OcZaGmTQ2W5KcDydhm+rQTQwbvVC7w0xcci+P13dDtswZQxPJFbAlncYE7 J0mEEu2zTu3sTIOOWLhvoRHu97PaFiZ5edYug= Received: by 10.65.116.10 with SMTP id t10mr3451178qbm.103.1231631781611; Sat, 10 Jan 2009 15:56:21 -0800 (PST) Received: from ?192.168.1.199? (pool-71-107-201-29.lsanca.dsl-w.verizon.net [71.107.201.29]) by mx.google.com with ESMTPS id k8sm49527244qba.5.2009.01.10.15.56.20 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 10 Jan 2009 15:56:20 -0800 (PST) Message-Id: <4BA2C29C-5F51-4670-8E45-6F977F2333BE@gmail.com> From: Dru Jensen To: hbase-user@hadoop.apache.org In-Reply-To: <7a8854060901101032j7a4d3a01y652dd502bbf1f796@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: Accessing rows with number indexes Date: Sat, 10 Jan 2009 16:01:28 -0800 References: <7a8854060901092056n37be4b09mcb5f39b0f06e92c7@mail.gmail.com> <92c4d8c10901100955r40d0a3fse3b83ab9b8d3fa02@mail.gmail.com> <7a8854060901101032j7a4d3a01y652dd502bbf1f796@mail.gmail.com> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org I'm not sure this will work or a good idea but is it possible to use the tableindexed feature in 0.19 and create an IndexKeyGenerator that does an auto increment? http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/client/tableindexed/package.html?view=markup On Jan 10, 2009, at 10:32 AM, Jim Twensky wrote: > Unfortunately, yes the sentences need to be sorted. I take advantage > of the > lexicographical ordering of the sentences for some other purpose. > Even if I > didn't, how could I generate the prefixes? Do you mean number prefixes > should be in the range [1-n] where n is the number of rows in the > table? > Since I use Hadoop to pull the data in, I can't see a trivial way to > generate number prefixes but I may be missing something obvious. > > Jim > > On Sat, Jan 10, 2009 at 11:55 AM, Tim Sell wrote: > >> Do the sentences need to be sorted? >> if not you could use an number prefix on the row key. Keep track of >> the highest prefix and use that range to select a prefix randomly. >> Then start a scanner at that prefix >> >> ~Tim. >> >> 2009/1/10 Jim Twensky : >>> Hello, >>> >>> I have an HBase table that contains sentences as row keys and a few >> numeric >>> values as columns. A simple abstract model of the table looks like >>> the >>> following: >>> >>> >> -------------------------------------------------------------------------------------------------------------------------- >>> Sentence | frequency:value | >>> probability:value-0 >>> | probability:value-2 >>> >> -------------------------------------------------------------------------------------------------------------------------- >>> Hello World | 5 | >>> 0.000545321 >>> | 0.002368204 >>> . . >>> . . >>> . . >>> . . >>> . . >>> . . >>> >> -------------------------------------------------------------------------------------------------------------------------- >>> >>> >>> I create the table and load it using Hadoop and there are hundreds >>> of >>> billions of entries in it. I use this table to solve an optimization >> problem >>> using a hill climbing/simulated annealing method. Basically, I >>> need to >>> change the likelihood values randomly. For example, I need to >>> change say >> the >>> first 5 rows starting at the 112th row and do some calculations >>> and so >> on... >>> >>> Now the problem is, I can't see an easy way to access to the n'th >>> row >>> directly. If I was using a traditional RDBMS, I'd add another >>> column and >>> auto-increment it each time I added a new row but this is not >>> possible >> since >>> I load the table using Hadoop and the there are parallel insertions >> taking >>> place simultaneously. A quick and dirty way to do this might be >>> adding a >> new >>> index column after I load and initialize the table but the table >>> is huge >> and >>> it doesn't seem right to me. Another bad approach would be to use a >> scanner >>> starting from the first row and calling Scanner.next() n times >>> inside a >> for >>> loop to access the n'th row, which also seems very slow. Any ideas >>> on how >> I >>> could do it more efficiently? >>> >>> Thanks in advance, >>> Jim >>> >>