cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Даниел Симеонов <dsimeo...@gmail.com>
Subject Re: question about how columns are deserialized in memory
Date Wed, 28 Apr 2010 15:36:05 GMT
Hi,
  What about if the upper bound of columns in a row is loosely defined, i.e.
it is ok that we have maximum of around 100 for example, but not exactly
(maybe 105, 110)?
What if I make a slice query to return say 1/5th of the columns in a row, I
believe that such query again will not deserialize all columns in memory?
Best regards, Daniel.

2010/4/28 Sylvain Lebresne <sylvain@yakaz.com>

> 2010/4/28 Даниел Симеонов <dsimeonov@gmail.com>:
> > Hi Sylvain,
> >   Thank you very much! I still have some further questions, I didn't find
> > how row cache is being configured?
>
> Provided you don't use trunk but something stable like 0.6.1 (which
> you should),
> it is in storage-conf.xml. It's one option of the definition of the
> column families (it
> is documented in the file).
>
> > Regarding the splitting of rows, I
> > understand that it is not so necessary, still I am curious whether it is
> > implementable by the client code.
>
> Well, I'm not sure there is any simple way to do it (at least not
> efficiently). Counting
> the number of columns in a row is expensive plus there is no easy way
> to implement
> counter in cassandra (even though
> https://issues.apache.org/jira/browse/CASSANDRA-580
> will make that better someday).
>
> > Best regards, Daniel.
> >
> > 2010/4/28 Sylvain Lebresne <sylvain@yakaz.com>
> >>
> >> 2010/4/28 Даниел Симеонов <dsimeonov@gmail.com>:
> >> > Hi,
> >> >    I have a question about if a row in a Column Family has only
> columns
> >> > whether all of the columns are deserialized in memory if you need any
> of
> >> > them? As I understood it is the case,
> >>
> >> No it's not. Only the columns you request are deserialized in memory.
> The
> >> only
> >> thing is that, as of now, during compaction the entire row will be
> >> deserialize at
> >> once. So it just have to still fit in memory. But depending of the
> >> typical size of
> >> your column, you can easily millions of columns in a row without it
> >> being a problem
> >> at all.
> >>
> >> >  and if the Column Family is super
> >> > Column Family, then only the Super Column (entire) is brought up in
> >> > memory?
> >>
> >> Yes, that part is true. That is the problem with the current
> >> implementation of super
> >> columns. While you can have lots of column in one row, you probably
> >> don't want to
> >> have lots of columns in one super column (but it's no problem to have
> >> lots of super
> >> column in one row).
> >>
> >> > What about row cache, is it different than memtable?
> >>
> >> Be careful with row cache. If row cache is enable, then yes, any read
> >> in a row will read
> >> the entire row. So you typically don't want to use row cache in column
> >> family where rows
> >> have lots of columns (unless you always read all the columns in the
> >> row each time of
> >> course).
> >>
> >> > I have another one question, let's say there is only data to be
> inserted
> >> > and
> >> > a solution to it is to have columns to be added to rows in Column
> >> > Family, is
> >> > it possible in Cassandra to split the row if certain threshold is
> >> > reached,
> >> > say 100 columns per row, what if there are concurrent inserts?
> >>
> >> No, cassandra can't do that for you. But you should be okay with what
> >> you describe
> >> below. That is, if a given row corresponds to an hour of data, it will
> >> limit it's size.
> >> And again, the number of column in a row is not really limited as long
> as
> >> the
> >> overall size of the row fits easily in memory.
> >>
> >> > The original data model and use case is to insert timestamped data and
> >> > to
> >> > make range queries. The original keys of CF rows were in the form of
> >> > <id>.<timestamp> and then a single column with data, OPP was
used.
> This
> >> > is
> >> > not an optimal solution, since nodes are hotter than others, I am
> >> > thinking
> >> > of changing the model in the way to have keys like
> <id>.<year/month/day>
> >> > and
> >> > then a list of columns with timestamps within this range and
> >> > RandomPartitioner or using OPP but preprocess part of the key with
> MD5,
> >> > i.e.
> >> > the key is MD5(<id>.<year/month/day>) + "hour of the day" .
Just the
> >> > problem
> >> > is how to deal with large number of columns being inserted in a
> >> > particular
> >> > row.
> >> > Thank you very much!
> >> > Best regards, Daniel.
> >
> >
>

Mime
View raw message