Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of rolo@pythian.com designates
 209.85.213.175 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <288863E0-3555-4F9A-8008-F40AE70CEBE9@gmail.com>
References: <288863E0-3555-4F9A-8008-F40AE70CEBE9@gmail.com>
From: Carlos Rolo <rolo@pythian.com>
Date: Wed, 11 Feb 2015 11:48:53 +0100
Message-ID: 
 <CALcD3Ptw+fkRy=BVRd6ong=sPFSxLTRs-D5sPm2op3G0_X0W6Q@mail.gmail.com>
Subject: Re: Two problems with Cassandra
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=bcaec51b1f2d0d3d83050ecdc1d2

--bcaec51b1f2d0d3d83050ecdc1d2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hello Pavel,

What is the size of the Cluster (# of nodes)? And you need to iterate over
the full 1TB every time you do the update? Or just parts of it?

IMO information is short to make any kind of assessment of the problem you
are having.

I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same
problem.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzartero=
lo
<http://linkedin.com/in/carlosjuzarterolo>*
Tel: 1649
www.pythian.com

On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov <pavel.velikhov@gmail.com>
wrote:

> Hi,
>
>   I=E2=80=99m using Cassandra to store NLP data, the dataset is not that =
huge
> (about 1TB), but I need to iterate over it quite frequently, updating the
> full dataset (each record, but not necessarily each column).
>
>   I=E2=80=99ve run into two problems (I=E2=80=99m using the latest Cassan=
dra):
>
>   1. I was trying to copy from one Cassandra cluster to another via a
> python driver, however the driver confused the two instances
>   2. While trying to update the full dataset with a simple transformation
> (again via python driver), single node and clustered Cassandra run out of
> memory no matter what settings I try, even I put a lot of sleeps into the
> mix. However simpler transformations (updating just one column, specially
> when there is a lot of processing overhead) work just fine.
>
> I=E2=80=99m really concerned about #2, since we=E2=80=99re moving all hea=
vy processing to
> a Spark cluster and will expand it, and I would expect much heavier traff=
ic
> to/from Cassandra. Any hints, war stories, etc. very appreciated!
>
> Thank you,
> Pavel Velikhov

--=20


--


--bcaec51b1f2d0d3d83050ecdc1d2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div>Hello Pavel,<br><br></div>What is the size =
of the Cluster (# of nodes)? And you need to iterate over the full 1TB ever=
y time you do the update? Or just parts of it?<br><br></div>IMO information=
 is short to make any kind of assessment of the problem you are having.<br>=
<br></div>I can suggest to try a 2.0.x (or 2.1.1) release to see if you get=
 the same problem. <br></div><div class=3D"gmail_extra"><br clear=3D"all"><=
div><div class=3D"gmail_signature"><div dir=3D"ltr"><div>Regards,<br></div>=
<div><br></div><div>Carlos Juzarte Rolo</div><div>Cassandra Consultant</div=
><div>=C2=A0</div><div>Pythian - Love your data</div><div><br></div><div>ro=
lo@pythian | Twitter: cjrolo | Linkedin: <font color=3D"#1155cc"><u><a href=
=3D"http://linkedin.com/in/carlosjuzarterolo" target=3D"_blank">linkedin.co=
m/in/carlosjuzarterolo</a></u></font></div><div>Tel:=C2=A01649</div><div><a=
 href=3D"http://www.pythian.com/" style=3D"color:rgb(17,85,204)" target=3D"=
_blank">www.pythian.com</a></div></div></div></div>
<br><div class=3D"gmail_quote">On Wed, Feb 11, 2015 at 11:22 AM, Pavel Veli=
khov <span dir=3D"ltr">&lt;<a href=3D"mailto:pavel.velikhov@gmail.com" targ=
et=3D"_blank">pavel.velikhov@gmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">Hi,<br>
<br>
=C2=A0 I=E2=80=99m using Cassandra to store NLP data, the dataset is not th=
at huge (about 1TB), but I need to iterate over it quite frequently, updati=
ng the full dataset (each record, but not necessarily each column).<br>
<br>
=C2=A0 I=E2=80=99ve run into two problems (I=E2=80=99m using the latest Cas=
sandra):<br>
<br>
=C2=A0 1. I was trying to copy from one Cassandra cluster to another via a =
python driver, however the driver confused the two instances<br>
=C2=A0 2. While trying to update the full dataset with a simple transformat=
ion (again via python driver), single node and clustered Cassandra run out =
of memory no matter what settings I try, even I put a lot of sleeps into th=
e mix. However simpler transformations (updating just one column, specially=
 when there is a lot of processing overhead) work just fine.<br>
<br>
I=E2=80=99m really concerned about #2, since we=E2=80=99re moving all heavy=
 processing to a Spark cluster and will expand it, and I would expect much =
heavier traffic to/from Cassandra. Any hints, war stories, etc. very apprec=
iated!<br>
<br>
Thank you,<br>
Pavel Velikhov</blockquote></div><br></div>

<br>
<p>--</p><p><br><br></p><p></p>
--bcaec51b1f2d0d3d83050ecdc1d2--