Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTilQyLdlQvJSr5X6dtzr7-YVy_lmEkxeiZYjARXs@mail.gmail.com>
References: <AANLkTin5rbPSa91787aFu5pnlN6UG_Lj8wO-oCTH3Mme@mail.gmail.com>
	<AANLkTilQyLdlQvJSr5X6dtzr7-YVy_lmEkxeiZYjARXs@mail.gmail.com>
From: =?UTF-8?Q?Utku_Can_Top=C3=A7u?= <utku@topcu.gen.tr>
Date: Wed, 12 May 2010 17:54:18 +0200
Message-ID: <AANLkTin_AxI5qkfs45s-h3Kc5i-NcpcwSwkTn5SuwWd1@mail.gmail.com>
Subject: Re: Real-time Web Analysis tool using Cassandra. Doubts...
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016367facf3c7ed4c048667a8a4

--0016367facf3c7ed4c048667a8a4
Content-Type: text/plain; charset=UTF-8

What makes cassandra a poor choice is the fact that, you can't use a
keyrange as input for the map phase for Hadoop.


On Wed, May 12, 2010 at 4:37 PM, Jonathan Ellis <jbellis@gmail.com> wrote:

> On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati
> <paulogpoiati@gmail.com> wrote:
> > - First of all, my first thoughts is to have two CF one for raw client
> > request (~10 millions++ per day) and other for aggregated metrics in some
> > defined inteval time like 1min, 5min, 15min... Is this a good approach ?
>
> Sure.
>
> > - It is a good idea to use a OrderPreservingPartitioner ? To maintain the
> > order of my requests in the raw data CF ? Or the overhead is too big.
>
> The problem with OPP isn't overhead (it is lower-overhead than RP) but
> the tendency to have hotspots in sequentially-written data.
>
> > - Initially the cluster will contain only three nodes, is it a problem
> (to
> > few maybe) ?
>
> You'll have to do some load testing to see.
>
> > - I think the best way to do the aggregation job is through a hadoop
> > MapReduce job. Right ? Is there any other way to consider ?
>
> Map/Reduce is usually better than rolling your own because it
> parallelizes for you.
>
> > - Is really Cassandra suitable for it ? Maybe HBase is better in this
> case?
>
> Nothing here makes me think "Cassandra is a poor choice."
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

--0016367facf3c7ed4c048667a8a4
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

What makes cassandra a poor choice is the fact that, you can&#39;t use a ke=
yrange as input for the map phase for Hadoop.<br><br><br><div class=3D"gmai=
l_quote">On Wed, May 12, 2010 at 4:37 PM, Jonathan Ellis <span dir=3D"ltr">=
&lt;<a href=3D"mailto:jbellis@gmail.com">jbellis@gmail.com</a>&gt;</span> w=
rote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class=3D"im"=
>On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati<br>
&lt;<a href=3D"mailto:paulogpoiati@gmail.com">paulogpoiati@gmail.com</a>&gt=
; wrote:<br>
&gt; - First of all, my first thoughts is to have two CF one for raw client=
<br>
&gt; request (~10 millions++ per day) and other for aggregated metrics in s=
ome<br>
&gt; defined inteval time like 1min, 5min, 15min... Is this a good approach=
 ?<br>
<br>
</div>Sure.<br>
<div class=3D"im"><br>
&gt; - It is a good idea to use a OrderPreservingPartitioner ? To maintain =
the<br>
&gt; order of my requests in the raw data CF ? Or the overhead is too big.<=
br>
<br>
</div>The problem with OPP isn&#39;t overhead (it is lower-overhead than RP=
) but<br>
the tendency to have hotspots in sequentially-written data.<br>
<div class=3D"im"><br>
&gt; - Initially the cluster will contain only three nodes, is it a problem=
 (to<br>
&gt; few maybe) ?<br>
<br>
</div>You&#39;ll have to do some load testing to see.<br>
<div class=3D"im"><br>
&gt; - I think the best way to do the aggregation job is through a hadoop<b=
r>
&gt; MapReduce job. Right ? Is there any other way to consider ?<br>
<br>
</div>Map/Reduce is usually better than rolling your own because it<br>
parallelizes for you.<br>
<div class=3D"im"><br>
&gt; - Is really Cassandra suitable for it ? Maybe HBase is better in this =
case?<br>
<br>
</div>Nothing here makes me think &quot;Cassandra is a poor choice.&quot;<b=
r>
<div><div></div><div class=3D"h5"><br>
--<br>
Jonathan Ellis<br>
Project Chair, Apache Cassandra<br>
co-founder of Riptano, the source for professional Cassandra support<br>
<a href=3D"http://riptano.com" target=3D"_blank">http://riptano.com</a><br>
</div></div></blockquote></div><br>

--0016367facf3c7ed4c048667a8a4--