Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 26251 invoked from network); 12 May 2010 15:55:07 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 May 2010 15:55:07 -0000 Received: (qmail 88338 invoked by uid 500); 12 May 2010 15:55:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 88315 invoked by uid 500); 12 May 2010 15:55:06 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 88307 invoked by uid 99); 12 May 2010 15:55:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 May 2010 15:55:06 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 May 2010 15:54:59 +0000 Received: by wyb36 with SMTP id 36so152650wyb.31 for ; Wed, 12 May 2010 08:54:38 -0700 (PDT) Received: by 10.216.171.147 with SMTP id r19mr4781150wel.70.1273679678270; Wed, 12 May 2010 08:54:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.80.96 with HTTP; Wed, 12 May 2010 08:54:18 -0700 (PDT) In-Reply-To: References: From: =?UTF-8?Q?Utku_Can_Top=C3=A7u?= Date: Wed, 12 May 2010 17:54:18 +0200 Message-ID: Subject: Re: Real-time Web Analysis tool using Cassandra. Doubts... To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016367facf3c7ed4c048667a8a4 X-Virus-Checked: Checked by ClamAV on apache.org --0016367facf3c7ed4c048667a8a4 Content-Type: text/plain; charset=UTF-8 What makes cassandra a poor choice is the fact that, you can't use a keyrange as input for the map phase for Hadoop. On Wed, May 12, 2010 at 4:37 PM, Jonathan Ellis wrote: > On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati > wrote: > > - First of all, my first thoughts is to have two CF one for raw client > > request (~10 millions++ per day) and other for aggregated metrics in some > > defined inteval time like 1min, 5min, 15min... Is this a good approach ? > > Sure. > > > - It is a good idea to use a OrderPreservingPartitioner ? To maintain the > > order of my requests in the raw data CF ? Or the overhead is too big. > > The problem with OPP isn't overhead (it is lower-overhead than RP) but > the tendency to have hotspots in sequentially-written data. > > > - Initially the cluster will contain only three nodes, is it a problem > (to > > few maybe) ? > > You'll have to do some load testing to see. > > > - I think the best way to do the aggregation job is through a hadoop > > MapReduce job. Right ? Is there any other way to consider ? > > Map/Reduce is usually better than rolling your own because it > parallelizes for you. > > > - Is really Cassandra suitable for it ? Maybe HBase is better in this > case? > > Nothing here makes me think "Cassandra is a poor choice." > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com > --0016367facf3c7ed4c048667a8a4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable What makes cassandra a poor choice is the fact that, you can't use a ke= yrange as input for the map phase for Hadoop.


On Wed, May 12, 2010 at 4:37 PM, Jonathan Ellis = <jbellis@gmail.com> w= rote:
On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati
<paulogpoiati@gmail.com>= ; wrote:
> - First of all, my first thoughts is to have two CF one for raw client=
> request (~10 millions++ per day) and other for aggregated metrics in s= ome
> defined inteval time like 1min, 5min, 15min... Is this a good approach= ?

Sure.

> - It is a good idea to use a OrderPreservingPartitioner ? To maintain = the
> order of my requests in the raw data CF ? Or the overhead is too big.<= br>
The problem with OPP isn't overhead (it is lower-overhead than RP= ) but
the tendency to have hotspots in sequentially-written data.

> - Initially the cluster will contain only three nodes, is it a problem= (to
> few maybe) ?

You'll have to do some load testing to see.

> - I think the best way to do the aggregation job is through a hadoop > MapReduce job. Right ? Is there any other way to consider ?

Map/Reduce is usually better than rolling your own because it
parallelizes for you.

> - Is really Cassandra suitable for it ? Maybe HBase is better in this = case?

Nothing here makes me think "Cassandra is a poor choice."

--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

--0016367facf3c7ed4c048667a8a4--