incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Black...@b3k.us>
Subject Re: Columns limit
Date Sun, 08 Aug 2010 02:04:46 GMT
certainly it matters: your previous version is not bounded on time, so
will grow without bound.  ergo, it is not a good fit for cassandra.

On Sat, Aug 7, 2010 at 2:51 PM, Mark <static.void.dev@gmail.com> wrote:
> On 8/7/10 2:33 PM, Benjamin Black wrote:
>>
>> Right, this is an index row per time interval (your previous email was
>> not).
>>
>> On Sat, Aug 7, 2010 at 11:43 AM, Mark<static.void.dev@gmail.com>  wrote:
>>
>>>
>>> On 8/7/10 11:30 AM, Mark wrote:
>>>
>>>>
>>>> On 8/7/10 4:22 AM, Thomas Heller wrote:
>>>>
>>>>>>
>>>>>> Ok, I think the part I was missing was the concatenation of the key
>>>>>> and
>>>>>> partition to do the look ups. Is this the preferred way of
>>>>>> accomplishing
>>>>>> needs such as this? Are there alternatives ways?
>>>>>>
>>>>>
>>>>> Depending on your needs you can concat the row key or use super
>>>>> columns.
>>>>>
>>>>>
>>>>>>
>>>>>> How would one then "query" over multiple days? Same question for
all
>>>>>> days.
>>>>>> Should I use range_slice or multiget_slice? And if its range_slice
>>>>>> does
>>>>>> that
>>>>>> mean I need OrderPreservingPartitioner?
>>>>>>
>>>>>
>>>>> The last 3 days is pretty simple: ['2010-08-07', '2010-08-06',
>>>>> '2010-08-05'], as is 7, 31, etc. Just generate the keys in your app
>>>>> and use multiget_slice.
>>>>>
>>>>> If you want to get all days where a specific ip address had some
>>>>> requests you'll just need another CF where the row key is the addr and
>>>>> column names are the days (values optional again). Pretty much the
>>>>> same all over again, just add another CF and insert the data you need.
>>>>>
>>>>> get_range_slice in my experience is better used for "offline" tasks
>>>>> where you really want to process every row there is.
>>>>>
>>>>> /thomas
>>>>>
>>>>
>>>> Ok... as an example using looking up logs by ip for a certain
>>>> timeframe/range would this work?
>>>>
>>>> <ColumnFamily Name="SearchLog"/>
>>>>
>>>> <ColumnFamily Name="IPSearchLog"
>>>>                           ColumnType="Super"
>>>>                           CompareWith="UTF8Type"
>>>>                           CompareSubcolumnsWith="TimeUUIDType"/>
>>>>
>>>> Resulting in a structure like:
>>>>
>>>> {
>>>>  "127.0.0.1" : {
>>>>       "2010080711" : {
>>>>            uuid1 : ""
>>>>            uuid2: ""
>>>>            uuid3: ""
>>>>       }
>>>>      "2010080712" : {
>>>>            uuid1 : ""
>>>>            uuid2: ""
>>>>            uuid3: ""
>>>>       }
>>>>   }
>>>>  "some.other.ip" : {
>>>>       "2010080711" : {
>>>>            uuid1 : ""
>>>>       }
>>>>   }
>>>> }
>>>>
>>>> Whereas each uuid is the key used for SearchLog.  Is there anything
>>>> wrong
>>>> with this? I know there is a 2 billion column limit but in this case
>>>> that
>>>> would never be exceeded because each column represents an hour. However
>>>> does
>>>> the above "schema" imply that for any certain IP there can only be a
>>>> maxium
>>>> of 2GB of data stored?
>>>>
>>>
>>> Or should I invert the ip with the time slices? The limitation of this
>>> seems
>>> like there can only be 2 billion unique ips per hour which is more than
>>> enough for our application :)
>>>
>>> {
>>>  "2010080711" : {
>>>       "127.0.0.1" : {
>>>            uuid1 : ""
>>>            uuid2: ""
>>>            uuid3: ""
>>>       }
>>>      "some.other.ip" : {
>>>            uuid1 : ""
>>>            uuid2: ""
>>>            uuid3: ""
>>>       }
>>>   }
>>>  "2010080712" : {
>>>       "127.0.0.1" : {
>>>            uuid1 : ""
>>>       }
>>>   }
>>> }
>>>
>>>
>>>
>
> In the end does it really matter which one to go with? I kind of like the
> previous version so I don't have to build up all the keys for the multi_get
> and instead I can just provide and start & finish for the columns (time
> frames).
>

Mime
View raw message