Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of patrik.modesto@gmail.com
 designates 209.85.215.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <35D93A89-556F-4A13-84FA-6E0E8599E5F0@thelastpickle.com>
References: 
 <CAC43XBkDVAifb39=_CVkriOwMd_BODcpzpF+73oeWjUACLEUpw@mail.gmail.com>
 <35D93A89-556F-4A13-84FA-6E0E8599E5F0@thelastpickle.com>
From: Patrik Modesto <patrik.modesto@gmail.com>
Date: Tue, 17 Apr 2012 13:25:33 +0200
Message-ID: 
 <CAC43XBmGOd883=9Fv=carGTbGvy=4SR-DKge0wYSC5=NXn7bMQ@mail.gmail.com>
Subject: Re: Poor write performance with seconrady index
To: user@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Aaron, thanks for the reply. I suspected it might be the
read-and-write that causes the slower updates.

Regards,
P.

On Tue, Apr 17, 2012 at 11:52, aaron morton <aaron@thelastpickle.com> wrote=
:
> Secondary indexes require a read and a write (potentially two) for every
> update. Regular mutations are no look writes and are much faster.
>
> Just like in a RDBMS, it's more efficient to insert data and then create =
the
> index than to insert data with the index present.
>
> An alternative is to create SSTables in the hadoop jobs and bulk load the=
m
> into the cluster.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 17/04/2012, at 2:51 AM, Patrik Modesto wrote:
>
> Hi,
>
> I've a 4 node test cluster running Cassandra 1.0.9, 32GB memory, 4x
> 1TB disks. I've two keyspaces, rfTest2 (RF=3D2) and rfTest3 (RF=3D3).
> There are two CF, one with source data and one with secondary index:
>
> create column family UrlGroup
> =C2=A0=C2=A0=C2=A0with column_type=3DStandard
> =C2=A0=C2=A0=C2=A0and comparator=3DUTF8Type
> =C2=A0=C2=A0=C2=A0and default_validation_class=3DUTF8Type
> =C2=A0=C2=A0=C2=A0and key_validation_class=3DUTF8Type
> =C2=A0=C2=A0=C2=A0and column_metadata=3D
> =C2=A0=C2=A0=C2=A0[{
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0column_name: groupId,
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0validation_class: UTF8Type,
> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0index_type: KEYS
> =C2=A0=C2=A0=C2=A0}];
>
> I'm running Hadoop mapreduce job, reading the source CF and creating 3
> mutations for each row-key in the UrlGroup CF.
>
> The mapreduce runs for 30minutes. When I remove the secondary index,
> the mapreduce runs just 10minutes. There are 26,273,544 mutations
> total.
>
> Also with the secondary index, the nodes show very high load 50+ and
> iowait 70%+. Without secondary index the load is ~5 and iowait ~10%.
>
> What may be the problem?
>
> Regards,
> Patrik
>
>