cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <static.void....@gmail.com>
Subject Re: Columns limit
Date Sun, 08 Aug 2010 04:06:13 GMT
On 8/7/10 7:04 PM, Benjamin Black wrote:
> certainly it matters: your previous version is not bounded on time, so
> will grow without bound.  ergo, it is not a good fit for cassandra.
>
> On Sat, Aug 7, 2010 at 2:51 PM, Mark<static.void.dev@gmail.com>  wrote:
>    
>> On 8/7/10 2:33 PM, Benjamin Black wrote:
>>      
>>> Right, this is an index row per time interval (your previous email was
>>> not).
>>>
>>> On Sat, Aug 7, 2010 at 11:43 AM, Mark<static.void.dev@gmail.com>    wrote:
>>>
>>>        
>>>> On 8/7/10 11:30 AM, Mark wrote:
>>>>
>>>>          
>>>>> On 8/7/10 4:22 AM, Thomas Heller wrote:
>>>>>
>>>>>            
>>>>>>> Ok, I think the part I was missing was the concatenation of the
key
>>>>>>> and
>>>>>>> partition to do the look ups. Is this the preferred way of
>>>>>>> accomplishing
>>>>>>> needs such as this? Are there alternatives ways?
>>>>>>>
>>>>>>>                
>>>>>> Depending on your needs you can concat the row key or use super
>>>>>> columns.
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> How would one then "query" over multiple days? Same question
for all
>>>>>>> days.
>>>>>>> Should I use range_slice or multiget_slice? And if its range_slice
>>>>>>> does
>>>>>>> that
>>>>>>> mean I need OrderPreservingPartitioner?
>>>>>>>
>>>>>>>                
>>>>>> The last 3 days is pretty simple: ['2010-08-07', '2010-08-06',
>>>>>> '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app
>>>>>> and use multiget_slice.
>>>>>>
>>>>>> If you want to get all days where a specific ip address had some
>>>>>> requests you'll just need another CF where the row key is the addr
and
>>>>>> column names are the days (values optional again). Pretty much the
>>>>>> same all over again, just add another CF and insert the data you
need.
>>>>>>
>>>>>> get_range_slice in my experience is better used for "offline" tasks
>>>>>> where you really want to process every row there is.
>>>>>>
>>>>>> /thomas
>>>>>>
>>>>>>              
>>>>> Ok... as an example using looking up logs by ip for a certain
>>>>> timeframe/range would this work?
>>>>>
>>>>> <ColumnFamily Name="SearchLog"/>
>>>>>
>>>>> <ColumnFamily Name="IPSearchLog"
>>>>>                            ColumnType="Super"
>>>>>                            CompareWith="UTF8Type"
>>>>>                            CompareSubcolumnsWith="TimeUUIDType"/>
>>>>>
>>>>> Resulting in a structure like:
>>>>>
>>>>> {
>>>>>   "127.0.0.1" : {
>>>>>        "2010080711" : {
>>>>>             uuid1 : ""
>>>>>             uuid2: ""
>>>>>             uuid3: ""
>>>>>        }
>>>>>       "2010080712" : {
>>>>>             uuid1 : ""
>>>>>             uuid2: ""
>>>>>             uuid3: ""
>>>>>        }
>>>>>    }
>>>>>   "some.other.ip" : {
>>>>>        "2010080711" : {
>>>>>             uuid1 : ""
>>>>>        }
>>>>>    }
>>>>> }
>>>>>
>>>>> Whereas each uuid is the key used for SearchLog.  Is there anything
>>>>> wrong
>>>>> with this? I know there is a 2 billion column limit but in this case
>>>>> that
>>>>> would never be exceeded because each column represents an hour. However
>>>>> does
>>>>> the above "schema" imply that for any certain IP there can only be a
>>>>> maxium
>>>>> of 2GB of data stored?
>>>>>
>>>>>            
>>>> Or should I invert the ip with the time slices? The limitation of this
>>>> seems
>>>> like there can only be 2 billion unique ips per hour which is more than
>>>> enough for our application :)
>>>>
>>>> {
>>>>   "2010080711" : {
>>>>        "127.0.0.1" : {
>>>>             uuid1 : ""
>>>>             uuid2: ""
>>>>             uuid3: ""
>>>>        }
>>>>       "some.other.ip" : {
>>>>             uuid1 : ""
>>>>             uuid2: ""
>>>>             uuid3: ""
>>>>        }
>>>>    }
>>>>   "2010080712" : {
>>>>        "127.0.0.1" : {
>>>>             uuid1 : ""
>>>>        }
>>>>    }
>>>> }
>>>>
>>>>
>>>>
>>>>          
>> In the end does it really matter which one to go with? I kind of like the
>> previous version so I don't have to build up all the keys for the multi_get
>> and instead I can just provide and start&  finish for the columns (time
>> frames).
>>
>>      
Is there any performance penalty for a multi_get that includes x keys 
versus a get on 1 key with a start/finish range of x?

Using your gem,

multi_get("SearchLog", ["20090101"..."20100807"], "127.0.0.1")
vs
get("SearchLog", "127.0.0.1", :start => "20090101", :finish => ""127.0.0.1")

Thanks

Mime
View raw message