Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of jakers@gmail.com designates
 209.85.160.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=QSeckUJMc9Utgg58j/y6DtLzNqmAyyMGrPTsraMySQggK/T4iUDuj2yQlmVEWCRHaf
         8Bq2YHMttBHBve1Wm4hxaLCm5VsZ1btqoKXQ55nrA47+uNhJJpMeJT8gcpYxoFy8D7K4
         zLLN1PzaOGzZBxbnTPQoCDhicF6KiLC5TNX7g=
MIME-Version: 1.0
In-Reply-To: <n2lb84530981005071721l7d829acaycf273176d5524cf5@mail.gmail.com>
References: <n2lb84530981005071721l7d829acaycf273176d5524cf5@mail.gmail.com>
Date: Sat, 8 May 2010 01:09:12 -0400
Message-ID: <t2ifcfc96431005072209ud7102384ze4cd65cfaffb541b@mail.gmail.com>
Subject: Re: BinaryMemtable and collisions
From: Jake Luciani <jakers@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e6475fee2c956904860e2d21

--0016e6475fee2c956904860e2d21
Content-Type: text/plain; charset=ISO-8859-1

Any reason why you aren't using Lucandra directly?

On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen <tobias.jungen@gmail.com>wrote:

> Greetings,
>
> Started getting my feet wet with Cassandra in earnest this week. I'm
> building a custom inverted index of sorts on top of Cassandra, in part
> inspired by the work of Jake Luciani in Lucandra. I've successfully loaded
> nearly a million documents over a 3-node cluster, and initial query tests
> look promising.
>
> The problem is that our target use case has hundreds of millions of
> documents (each document is very small however). Loading time will be an
> important factor. I've investigated using the BinaryMemtable interface (as
> found in contrib/bmt_example) to speed up bulk insertion. I have a prototype
> up that successfully inserts data using BMT, but there is a problem.
>
> If I perform multiple writes for the same row key & column family, the row
> ends up containing only one of the writes. I'm guessing this is because with
> BMT I need to group all writes for a given row key & column family into one
> operation, rather than doing it incrementally as is possible with the thrift
> interface. Hadoop obviously is the solution for doing such a grouping.
> Unfortunately, we can't perform such a process over our entire dataset, we
> will need to do it in increments.
>
> So my question is: If I properly flush every node after performing a larger
> bulk insert, can Cassandra merge multiple writes on a single row & column
> family when using the BMT interface? Or is using BMT only feasible for
> loading data on rows that don't exist yet?
>
> Thanks in advance,
> Toby Jungen
>
>
>
>

--0016e6475fee2c956904860e2d21
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Any reason why you aren&#39;t using Lucandra directly?<br><br><div class=3D=
"gmail_quote">On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen <span dir=3D"lt=
r">&lt;<a href=3D"mailto:tobias.jungen@gmail.com">tobias.jungen@gmail.com</=
a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex;">Greetings,<br><br>Started getting my feet w=
et with Cassandra in earnest this week. I&#39;m building a custom inverted =
index of sorts on top of Cassandra, in part inspired by the work of Jake Lu=
ciani in Lucandra. I&#39;ve successfully loaded nearly a million documents =
over a 3-node cluster, and initial query tests look promising.<br>

<br>The problem is that our target use case has hundreds of millions of doc=
uments (each document is very small however). Loading time will be an impor=
tant factor. I&#39;ve investigated using the BinaryMemtable interface (as f=
ound in contrib/bmt_example) to speed up bulk insertion. I have a prototype=
 up that successfully inserts data using BMT, but there is a problem.<br>

<br>If I perform multiple writes for the same row key &amp; column family, =
the row ends up containing only one of the writes. I&#39;m guessing this is=
 because with BMT I need to group all writes for a given row key &amp; colu=
mn family into one operation, rather than doing it incrementally as is poss=
ible with the thrift interface. Hadoop obviously is the solution for doing =
such a grouping. Unfortunately, we can&#39;t perform such a process over ou=
r entire dataset, we will need to do it in increments. <br>

<br>So my question is: If I properly flush every node after performing a la=
rger bulk insert, can Cassandra merge multiple writes on a single row &amp;=
 column family when using the BMT interface? Or is using BMT only feasible =
for loading data on rows that don&#39;t exist yet?<br>

<br>Thanks in advance,<br>Toby Jungen<br><br><br><br>
</blockquote></div><br>

--0016e6475fee2c956904860e2d21--