cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <SEAN_R_DUR...@homedepot.com>
Subject RE: Two problems with Cassandra
Date Tue, 17 Feb 2015 20:03:21 GMT
Full table scans are not the best use case for Cassandra. Without some kind of pagination,
the node taking the request (the coordinator node) will try to assemble the data from all
nodes to return to the client. With a dataset of any decent size, it will overwhelm the single
node.

Pagination is supported in newer versions of Cassandra (2.0.x+, I think) and some drivers.
You can see there is other discussion on the list about the best ways to split your workload
and do some parallel processing. Something I haven’t seen mentioned recently (but probably
discussed before I joined the list) is setting up a separate, analytics DC. There you could
integrate with hadoop or spark or just size your nodes differently to handle an analytics
type workload.

We have found that it is better to use a list of known keys and pull back rows (aka partitions)
individually for any table scan type operations. However, we are usually able to generate
the list of keys outside of Cassandra…


Sean Durity – Cassandra Admin, Home Depot

From: Pavel Velikhov [mailto:pavel.velikhov@gmail.com]
Sent: Thursday, February 12, 2015 4:23 AM
To: user@cassandra.apache.org
Subject: Re: Two problems with Cassandra


On Feb 12, 2015, at 12:37 AM, Robert Coli <rcoli@eventbrite.com<mailto:rcoli@eventbrite.com>>
wrote:

On Wed, Feb 11, 2015 at 2:22 AM, Pavel Velikhov <pavel.velikhov@gmail.com<mailto:pavel.velikhov@gmail.com>>
wrote:
  2. While trying to update the full dataset with a simple transformation (again via python
driver), single node and clustered Cassandra run out of memory no matter what settings I try,
even I put a lot of sleeps into the mix. However simpler transformations (updating just one
column, specially when there is a lot of processing overhead) work just fine.

What does a "simple transformation" mean here? Assuming a reasonable sized heap, OOM sounds
like you're trying to update a large number of large partitions in a single operation.

In general, in Cassandra, you're best off interacting with a single or small number of partitions
in any given interaction.

=Rob


Hi Robert!

  Simple transformation is changing just a single column value (for I usually do it for the
whole dataset).
  But when I was running out of memory, I was reading in 5 columns and updating 3. Some of
them could be big, but I need to check and rerun this case.
  (I worked around this by dumping to files and then scanning the files and updating the database,
but this stinks!)

  I don’t quite understand the fundamentals of Cassandra - if I’m just doing one scan
with a reasonable number of columns that I fetch, and I’m updating at the same time, what’s
happening there? Why eat up so much memory and die?

________________________________

The information in this Internet Email is confidential and may be legally privileged. It is
intended solely for the addressee. Access to this Email by anyone else is unauthorized. If
you are not the intended recipient, any disclosure, copying, distribution or any action taken
or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed
to our clients any opinions or advice contained in this Email are subject to the terms and
conditions expressed in any applicable governing The Home Depot terms of business or client
engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy
and content of this attachment and for any damages or losses arising from any inaccuracies,
errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature,
which may be contained in this attachment and shall not be liable for direct, indirect, consequential
or special damages in connection with this e-mail message or its attachment.
Mime
View raw message