hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Advice on table design
Date Mon, 22 Dec 2008 00:00:56 GMT
tim robertson wrote:
> Presumably a filter in a scanner runs as a filter in the first Map()
> job or is there something else going on?
>   

When you setup an MR job, you can pass a filter (if using 
TableInputFormat -- see setTableFilter)).  Filter will run in all maps, 
not just the first.

Unfortunately, its not possible to specify start/stop row running a MR 
at the moment, not unless you do your own splitter.  This is being 
looked into (HBASE-1075).

St.Ack

> Thanks
>
> Tim
>
> On Sun, Dec 21, 2008 at 3:16 AM, stack <stack@duboce.net> wrote:
>   
>> Ryan LeCompte wrote:
>>     
>>> JG,
>>>
>>> Thanks for the tips!
>>>
>>> Question: If I decide to use a combined key of timestamp + ID, do you
>>> know if the query API has a way to do a partial search of the row key?
>>>
>>>       
>> There is a filter mechanism in hbase.  They run server-side.  They filter on
>> row and/or column content.
>>
>> Scanners can be passed a start and end row.
>>
>> One approach would be to start a scanner between the times you are
>> interested in and pass in a filter that only returns the client rows for a
>> particular ID (or that match a particular regex), for example.
>>
>> St.Ack
>>
>>     
>>> Or would I have to write a M/R job that does a quick parse of the key
>>> and not process any row key that doesn't fit within my time range?
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>> On Sat, Dec 20, 2008 at 8:02 PM,  <jlist@streamy.com> wrote:
>>>
>>>       
>>>> Ryan,
>>>>
>>>> The real question is how you want to query them.
>>>>
>>>> Do you want to look at them in chronological order?  Do you want to be
>>>> able to efficiently look at all requests for a particular user?  All or a
>>>> particular time period?  Efficiently access a known request's (user +
>>>> timestamp) serialized object?  Or you just want to see all of it all the
>>>> time, like your MR jobs will scan across everything?
>>>>
>>>> 1000s of columns should be no problem, I have hundreds of thousands in
>>>> production in a single row-family.  There may be issues with millions,
>>>> and
>>>> you'll need to take into account the potential size of your objects.  A
>>>> row can only grow to the size of a region (which is 256M default but
>>>> configurable).
>>>>
>>>> Your suggested design is best suited for looking at all requests for a
>>>> user, less-so if you're interested in looking at things with respect to
>>>> time.  Though if you are only concerned with MR jobs, you typically have
>>>> the entire table as input so this design can be okay for looking only at
>>>> certain time ranges.
>>>>
>>>> Another possibility might be to have row keys that are timestamp+user/ip.
>>>> Your table would be ordered by time so it would be easier to use scanners
>>>> to efficiently seek to a stamp and look forward.  I've not actually
>>>> attempted to do an MR job with a startRow, not sure if it's easy to do or
>>>> not.  But in the case that you end up with years worth of data (thousands
>>>> of regions in a table) and you want to process 1 day, it could end up
>>>> being much more efficient not having to scan everything (thousands of
>>>> unnecessary map tasks).
>>>>
>>>> I'm thinking out loud a bit, hopefully others chime in :)
>>>>
>>>> JG
>>>>
>>>> On Sat, December 20, 2008 3:34 pm, Ryan LeCompte wrote:
>>>>
>>>>         
>>>>> Hello all,
>>>>>
>>>>>
>>>>> I'd like a little advice on the best way to design a table in HBase.
>>>>> Basically, I want to store apache access log requests in HBase so that
>>>>> I can query them efficiently. The problem is that each request may
>>>>> have 100's of parameters and also many requests can come in for the same
>>>>> user/ip address.
>>>>>
>>>>> So, I was thinking of the following:
>>>>>
>>>>>
>>>>> 1 table called "requests" and a single column family called "request"
>>>>>
>>>>>
>>>>> Each row would have a key representing the user's ip address/unique
>>>>> identifier, and the columns would be a timestamp of when the request
>>>>> occurred, and the cell value would be a serializable Java object
>>>>> representing all the url parameters of the apache web server log request
>>>>> at that specific time.
>>>>>
>>>>> Possible problems:
>>>>>
>>>>>
>>>>> 1) There may be thousands of requests that belong to a single unique
>>>>> identifier (so there would be 1000s of columns)
>>>>>
>>>>> Any suggestions on how to represent this best? Is anyone doing
>>>>> anything similar?
>>>>>
>>>>> FYI: I'm using Hadoop 0.19 and HBase-TRUNK.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>         
>>     


Mime
View raw message