Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of tobias.jungen@gmail.com
 designates 74.125.82.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=V1gM+EIb7wK9YdpiWzwbFXJ/Fl1xEpBJa4jHONx1yhDsgzY/I+nemHbotal34ARQ/7
         z5YHOewBpGXD8xUTuRpxcqGNWL5V3ED5W/s6gohgepd6ATB5il2mS0RuYFK/nUfzOUOn
         tMQYf+ba0VSxcUhJ6/tiDl2VapiHoS+ePTvRM=
MIME-Version: 1.0
In-Reply-To: <t2ifcfc96431005072209ud7102384ze4cd65cfaffb541b@mail.gmail.com>
References: <n2lb84530981005071721l7d829acaycf273176d5524cf5@mail.gmail.com>
	 <t2ifcfc96431005072209ud7102384ze4cd65cfaffb541b@mail.gmail.com>
Date: Sat, 8 May 2010 00:17:20 -0500
Message-ID: <AANLkTikGKEy-wwn-fbf1nqc6dTXXTsJKqyBG4Ejji0Yj@mail.gmail.com>
Subject: Re: BinaryMemtable and collisions
From: Tobias Jungen <tobias.jungen@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e6dab0244a6c5e04860e4a25

--0016e6dab0244a6c5e04860e4a25
Content-Type: text/plain; charset=ISO-8859-1

Without going into too much depth: Our retrieval model is a bit more
structured than standard lucene retrieval, and I'm trying to leverage that
structure. Some of the terms we're going to retrieve against have high
occurrence, and because of that I'm worried about getting killed by
processing large term vectors. Instead I'm trying to index on term
relationships, if that makes sense.

On Sat, May 8, 2010 at 12:09 AM, Jake Luciani <jakers@gmail.com> wrote:

> Any reason why you aren't using Lucandra directly?
>
>
> On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen <tobias.jungen@gmail.com>wrote:
>
>> Greetings,
>>
>> Started getting my feet wet with Cassandra in earnest this week. I'm
>> building a custom inverted index of sorts on top of Cassandra, in part
>> inspired by the work of Jake Luciani in Lucandra. I've successfully loaded
>> nearly a million documents over a 3-node cluster, and initial query tests
>> look promising.
>>
>> The problem is that our target use case has hundreds of millions of
>> documents (each document is very small however). Loading time will be an
>> important factor. I've investigated using the BinaryMemtable interface (as
>> found in contrib/bmt_example) to speed up bulk insertion. I have a prototype
>> up that successfully inserts data using BMT, but there is a problem.
>>
>> If I perform multiple writes for the same row key & column family, the row
>> ends up containing only one of the writes. I'm guessing this is because with
>> BMT I need to group all writes for a given row key & column family into one
>> operation, rather than doing it incrementally as is possible with the thrift
>> interface. Hadoop obviously is the solution for doing such a grouping.
>> Unfortunately, we can't perform such a process over our entire dataset, we
>> will need to do it in increments.
>>
>> So my question is: If I properly flush every node after performing a
>> larger bulk insert, can Cassandra merge multiple writes on a single row &
>> column family when using the BMT interface? Or is using BMT only feasible
>> for loading data on rows that don't exist yet?
>>
>> Thanks in advance,
>> Toby Jungen
>>
>>
>>
>>
>

--0016e6dab0244a6c5e04860e4a25
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Without going into too much depth: Our retrieval model is a bit more struct=
ured than standard lucene retrieval, and I&#39;m trying to leverage that st=
ructure. Some of the terms we&#39;re going to retrieve against have high oc=
currence, and because of that I&#39;m worried about getting killed by proce=
ssing large term vectors. Instead I&#39;m trying to index on term relations=
hips, if that makes sense.<br>
<br><div class=3D"gmail_quote">On Sat, May 8, 2010 at 12:09 AM, Jake Lucian=
i <span dir=3D"ltr">&lt;<a href=3D"mailto:jakers@gmail.com">jakers@gmail.co=
m</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margi=
n: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-le=
ft: 1ex;">
Any reason why you aren&#39;t using Lucandra directly?<div><div></div><div =
class=3D"h5"><br><br><div class=3D"gmail_quote">On Fri, May 7, 2010 at 8:21=
 PM, Tobias Jungen <span dir=3D"ltr">&lt;<a href=3D"mailto:tobias.jungen@gm=
ail.com" target=3D"_blank">tobias.jungen@gmail.com</a>&gt;</span> wrote:<br=
>

<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Greetings,<br><br=
>Started getting my feet wet with Cassandra in earnest this week. I&#39;m b=
uilding a custom inverted index of sorts on top of Cassandra, in part inspi=
red by the work of Jake Luciani in Lucandra. I&#39;ve successfully loaded n=
early a million documents over a 3-node cluster, and initial query tests lo=
ok promising.<br>


<br>The problem is that our target use case has hundreds of millions of doc=
uments (each document is very small however). Loading time will be an impor=
tant factor. I&#39;ve investigated using the BinaryMemtable interface (as f=
ound in contrib/bmt_example) to speed up bulk insertion. I have a prototype=
 up that successfully inserts data using BMT, but there is a problem.<br>


<br>If I perform multiple writes for the same row key &amp; column family, =
the row ends up containing only one of the writes. I&#39;m guessing this is=
 because with BMT I need to group all writes for a given row key &amp; colu=
mn family into one operation, rather than doing it incrementally as is poss=
ible with the thrift interface. Hadoop obviously is the solution for doing =
such a grouping. Unfortunately, we can&#39;t perform such a process over ou=
r entire dataset, we will need to do it in increments. <br>


<br>So my question is: If I properly flush every node after performing a la=
rger bulk insert, can Cassandra merge multiple writes on a single row &amp;=
 column family when using the BMT interface? Or is using BMT only feasible =
for loading data on rows that don&#39;t exist yet?<br>


<br>Thanks in advance,<br>Toby Jungen<br><br><br><br>
</blockquote></div><br>
</div></div></blockquote></div><br>

--0016e6dab0244a6c5e04860e4a25--