Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4978D192.60504@duboce.net>
Date: Thu, 22 Jan 2009 12:05:38 -0800
From: stack <stack@duboce.net>
User-Agent: Thunderbird 2.0.0.19 (Macintosh/20081209)
MIME-Version: 1.0
To: hbase-user@hadoop.apache.org
Subject: Re: HBase random read technics
References: <013101c97c9c$85024270$8f06c750$@com>
 <4978B104.5090105@duboce.net>
 <9683564c0901221144t455e002blf4f1d618768e024b@mail.gmail.com>
In-Reply-To: <9683564c0901221144t455e002blf4f1d618768e024b@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Genady Gillin wrote:
> Hi,
>
> We use HBase 0.19Rc2, our data(~800GB) resides in one table( is it bad?),
> schema of table is pretty simple - it's two column families, one is keys and
> second is value, each key could have one or more values(~100). 
Keys in one column family and values in another?  Why not both in the 
one column family?

You use the keys in first column family to do lookups into the second?


> To query
> values used some file with keys(for instance about 10M keys), so the purpose
> is to read all values for each one of keys, where expected performance is
> about 1 hour. By the way data output is not too big ~2Gb.
>   

Can you sort the keys and then start a scanner with perhaps start and 
stop keys being first and last from file?  Does that run faster?

But sounds like you need to run an MR job.  You tried that and it 
failed.  You tried on same hardware?  My guess is your were running into 
the issue we're discussing in other email ('.... slept too long...').

St.Ack


> Thanks,
> Gennady
>
>
>
> On Thu, Jan 22, 2009 at 7:46 PM, stack <stack@duboce.net> wrote:
>
>   
>> Genady wrote:
>>
>>     
>>> Hi,
>>>
>>>
>>> Just wondering if somebody could recommend a random read strategy for
>>> searching a big group of the keys(100M) in hadoop/hbase cluster, using one
>>> client is very slow, separating an input to smaller groups and running
>>> each
>>> one with a different client is certainly improves performance, but maximum
>>> speed I'm getting is ~3300 read/sec. I've tried to use map reduce and to
>>> run
>>> search as map-reduce ask and to run HBase reads from map or reduce, but
>>> HBase is start to fail. So hardware upgrade and creating HBase in memory
>>> tables is only direction here?
>>>
>>>
>>>
>>>       
>> Tell us more about your table schema, data sizes, and the types of query.
>>  What performance do you need from hbase?  Do your rows have many columns
>> and you are trying to get all columns when you query for example?  Are you
>> on 0.19.0 Genady (sorry if you've answered this question in the near past)?
>> St.Ack
>>
>>     
>
>