incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tyler Hobbs <ty...@riptano.com>
Subject Re: Recommended sort mechanism and partitioner
Date Fri, 15 Oct 2010 23:56:22 GMT
i) Yes

ii) Well, so you don't actually want to use version 1 UUIDs for keys here.
Although
they mostly increase in byte order over time, it's only for the first 8
bytes.  Instead,
you can use something like:

'timestamp-foo'

Where 'foo' might be a randomly generated string or something unique per
client.

You could also use 'YYYYMMDDSSmmm' instead of the timestamp if that makes
queries easier for you.

- Tyler

On Fri, Oct 15, 2010 at 6:22 PM, Wicked J <wickedj2010@gmail.com> wrote:

> Tyler,
> Thanks for answering my question. Can you please clarify on point (c)?
>
> i] Are you saying that if I move to second row (identified by a rowKey in
> Cassandra) after I hit 10 million  col. values for 1st row, only then the
> second row will be written to a new node in the cluster?  meaning all the 10
> million column values within the first row (rowKey) until then have been
> written to one and the same node regardless of the # of nodes in the
> cluster.
>
> ii] Assume I change my data model to the one in below (CF1) with a
> "OrderPreservingPartitioner" then would I be able to read data in the order
> inserted? Because my understanding is TimeUUID values cannot be inserted for
> row Keys based on the Thrift API in v0.6.4 i.e. from the insert method in
> Cassandra.Client or am I missing something?
>
> CF1:
>
> Key: '1'
>   name: colname, value: 'First Inserted', timestamp: 1287165326492
> Key: '2'
>   name: colname, value: 'Second Inserted', timestamp: 1287165326523
>
> Thanks!
>
>
> On Fri, Oct 15, 2010 at 12:18 PM, Tyler Hobbs <tyler@riptano.com> wrote:
>
>> a) 10 mil sounds fine.  Just watch out for compaction. Huge rows can kill
>> you there,
>> from my understanding.
>>
>> b) Use RandomPartitioner unless you absolutely have to use something else.
>>
>> c) If you're inserting all along one row and only moving to another row
>> when you
>> hit 10 mil, you're only going to be writing to one node at a time.  In
>> this sense,
>> you might want to consider using the TimeUUID as a row key instead.
>> There's
>> not really a problem with having tons of rows in a column family.
>>
>> If you want to be able to get a slice of time with this scheme, you can
>> either use
>> an order preserving partitioner or have a second column family with an
>> index
>> row (or rows) sorted by TimeUUID. (This sounds like what you're
>> suggesting.)
>>
>> - Tyler
>>
>>
>> I wrote some thoughts about this on my blog. I think it's still mostly
>>> correct:
>>>
>>>  * http://www.ayogo.com/techblog/2010/04/sorting-in-cassandra/
>>>
>>> On Fri, Oct 15, 2010 at 11:14 AM, Wicked J <wickedj2010@gmail.com>
>>> wrote:
>>> > Hi,
>>> > I'm using TimeUUID/Sort by column name mechanism. The column value can
>>> > contain text data (in future they may contain image data as well)
>>> leading to
>>> > the possibility of a row out-growing the RAM capacity. Given this
>>> background
>>> > my questions are:
>>> >
>>> > a] How many columns are recommended against one row? Based on my app.
>>> needs,
>>> > I can imagine having 10 million would be a good starting point for the
>>> > max_limit (based on text data). Also note that my app. will use search
>>> in
>>> > ranges of 100 or 200 columns when there are large number of
>>> records(columnar
>>> > data) without a caching solution in the front.
>>> > b] What partitioner is recommended? so that the load in the cluster
>>> nodes is
>>> > not largely uneven.
>>> > c] Would you recommend changing the TimeUUID/Columnar sort mechanism
>>> (with a
>>> > change in the data model) to sort using row key mechanism? If so then
>>> what
>>> > partitioner is recommended?  with load not being largely uneven.
>>> >
>>> > Thanks
>>> >
>>>
>>
>>
>

Mime
View raw message