incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <static.void....@gmail.com>
Subject Re: Columns limit
Date Sat, 07 Aug 2010 21:51:39 GMT
On 8/7/10 2:33 PM, Benjamin Black wrote:
> Right, this is an index row per time interval (your previous email was not).
>
> On Sat, Aug 7, 2010 at 11:43 AM, Mark<static.void.dev@gmail.com>  wrote:
>    
>> On 8/7/10 11:30 AM, Mark wrote:
>>      
>>> On 8/7/10 4:22 AM, Thomas Heller wrote:
>>>        
>>>>> Ok, I think the part I was missing was the concatenation of the key and
>>>>> partition to do the look ups. Is this the preferred way of accomplishing
>>>>> needs such as this? Are there alternatives ways?
>>>>>            
>>>> Depending on your needs you can concat the row key or use super columns.
>>>>
>>>>          
>>>>> How would one then "query" over multiple days? Same question for all
>>>>> days.
>>>>> Should I use range_slice or multiget_slice? And if its range_slice does
>>>>> that
>>>>> mean I need OrderPreservingPartitioner?
>>>>>            
>>>> The last 3 days is pretty simple: ['2010-08-07', '2010-08-06',
>>>> '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app
>>>> and use multiget_slice.
>>>>
>>>> If you want to get all days where a specific ip address had some
>>>> requests you'll just need another CF where the row key is the addr and
>>>> column names are the days (values optional again). Pretty much the
>>>> same all over again, just add another CF and insert the data you need.
>>>>
>>>> get_range_slice in my experience is better used for "offline" tasks
>>>> where you really want to process every row there is.
>>>>
>>>> /thomas
>>>>          
>>> Ok... as an example using looking up logs by ip for a certain
>>> timeframe/range would this work?
>>>
>>> <ColumnFamily Name="SearchLog"/>
>>>
>>> <ColumnFamily Name="IPSearchLog"
>>>                            ColumnType="Super"
>>>                            CompareWith="UTF8Type"
>>>                            CompareSubcolumnsWith="TimeUUIDType"/>
>>>
>>> Resulting in a structure like:
>>>
>>> {
>>>   "127.0.0.1" : {
>>>        "2010080711" : {
>>>             uuid1 : ""
>>>             uuid2: ""
>>>             uuid3: ""
>>>        }
>>>       "2010080712" : {
>>>             uuid1 : ""
>>>             uuid2: ""
>>>             uuid3: ""
>>>        }
>>>    }
>>>   "some.other.ip" : {
>>>        "2010080711" : {
>>>             uuid1 : ""
>>>        }
>>>    }
>>> }
>>>
>>> Whereas each uuid is the key used for SearchLog.  Is there anything wrong
>>> with this? I know there is a 2 billion column limit but in this case that
>>> would never be exceeded because each column represents an hour. However does
>>> the above "schema" imply that for any certain IP there can only be a maxium
>>> of 2GB of data stored?
>>>        
>> Or should I invert the ip with the time slices? The limitation of this seems
>> like there can only be 2 billion unique ips per hour which is more than
>> enough for our application :)
>>
>> {
>>   "2010080711" : {
>>        "127.0.0.1" : {
>>             uuid1 : ""
>>             uuid2: ""
>>             uuid3: ""
>>        }
>>       "some.other.ip" : {
>>             uuid1 : ""
>>             uuid2: ""
>>             uuid3: ""
>>        }
>>    }
>>   "2010080712" : {
>>        "127.0.0.1" : {
>>             uuid1 : ""
>>        }
>>    }
>> }
>>
>>
>>      
In the end does it really matter which one to go with? I kind of like 
the previous version so I don't have to build up all the keys for the 
multi_get and instead I can just provide and start & finish for the 
columns (time frames).

Mime
View raw message