Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 19275 invoked from network); 30 Apr 2010 14:09:21 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 30 Apr 2010 14:09:21 -0000 Received: (qmail 18498 invoked by uid 500); 30 Apr 2010 14:09:20 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 18481 invoked by uid 500); 30 Apr 2010 14:09:20 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 18473 invoked by uid 99); 30 Apr 2010 14:09:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Apr 2010 14:09:20 +0000 X-ASF-Spam-Status: No, hits=2.5 required=10.0 tests=AWL,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Apr 2010 14:09:15 +0000 Received: by wyb35 with SMTP id 35so189467wyb.31 for ; Fri, 30 Apr 2010 07:08:54 -0700 (PDT) Received: by 10.216.88.134 with SMTP id a6mr4507374wef.66.1272636533588; Fri, 30 Apr 2010 07:08:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.186.78 with HTTP; Fri, 30 Apr 2010 07:08:32 -0700 (PDT) In-Reply-To: References: From: =?UTF-8?Q?Utku_Can_Top=C3=A7u?= Date: Fri, 30 Apr 2010 16:08:32 +0200 Message-ID: Subject: Re: ColumnFamilyInputFormat KeyRange scans on a CF To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016e6d7ea6a83442e048574c8ed --0016e6d7ea6a83442e048574c8ed Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Do you mean, running the get_range_slices from a single? Yes, it would be reasonable for a relatively small key range, when it comes to analyze a really big range in really big data collection (i.e. like the one we currently populate) having a way for distributing the reads among the cluster seems the only reasonable solution. In this current situation, the best option might be distributing the range among ColumnFamilies (say, 1 CF for each day) and emptying the CF in order to reuse for another day range after analyzing the data. Can you suggest a workaround for this? On Fri, Apr 30, 2010 at 3:22 PM, Jonathan Ellis wrote: > Sounds like doing this w/o m/r with get_range_slices is a reasonable way = to > go. > > On Thu, Apr 29, 2010 at 6:04 PM, Utku Can Top=C3=A7u = wrote: > > I'm currently writing collected data continuously to Cassandra, having > keys > > starting with a timestamp and a unique identifier (like > > 2009.01.01.00.00.00.RANDOM) for being able to query in time ranges. > > > > I'm thinking of running periodical mapreduce jobs which will go through= a > > designated time period. I might want to analyze the data only between > > 2009.01 and 2009.02. > > I had done this previously with HBase however I thought cassandra would > be a > > better choice for continuously storing data in a safe manner. > > > > I guess this briefly explains my designated use case. > > > > Best Regards, > > Utku > > > > On Thu, Apr 29, 2010 at 11:32 PM, Jonathan Ellis > wrote: > >> > >> It's technically possible but 0.6 does not support this, no. > >> > >> What is the use case you are thinking of? > >> > >> On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Top=C3=A7u > >> wrote: > >> > Hi, > >> > > >> > I've been trying to use Cassandra for some kind of a supplementary > input > >> > source for Hadoop MapReduce jobs. > >> > > >> > The default usage of the ColumnFamilyInputFormat does a full > >> > columnfamily > >> > scan for using within the MapReduce framework as map input. > >> > > >> > However I believe that, it should be possible to give a keyrange to > scan > >> > the > >> > specified range. > >> > > >> > Is it anymeans possible? > >> > > >> > Best Regards, > >> > > >> > Utku > >> > >> -- > >> Jonathan Ellis > >> Project Chair, Apache Cassandra > >> co-founder of Riptano, the source for professional Cassandra support > >> http://riptano.com > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com > --0016e6d7ea6a83442e048574c8ed Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Do you mean, running the get_range_slices from a single? Yes, it would be r= easonable for a relatively small key range, when it comes to analyze a real= ly big range in really big data collection (i.e. like the one we currently = populate) having a way for distributing the reads among the cluster seems t= he only reasonable solution.

In this current situation, the best option might be distributing the ra= nge among ColumnFamilies (say, 1 CF for each day) and emptying the CF in or= der to reuse for another day range after analyzing the data.

Can you= suggest a workaround for this?

On Fri, Apr 30, 2010 at 3:22 PM, Jonathan El= lis <jbellis@gmai= l.com> wrote:
Sounds like doing this w/o m/r with get_range_slices is a reasonable way to= go.

On Thu, Apr 29, 2010 at 6:04 PM, Utku Can Top=C3=A7u <utku@topcu.gen.tr> wrote:
> I'm currently writing collected data continuously to Cassandra, ha= ving keys
> starting with a timestamp and a unique identifier (like
> 2009.01.01.00.00.00.RANDOM) for being able to query in time ranges. >
> I'm thinking of running periodical mapreduce jobs which will go th= rough a
> designated time period. I might want to analyze the data only between<= br> > 2009.01 and 2009.02.
> I had done this previously with HBase however I thought cassandra woul= d be a
> better choice for continuously storing data in a safe manner.
>
> I guess this briefly explains my designated use case.
>
> Best Regards,
> Utku
>
> On Thu, Apr 29, 2010 at 11:32 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>
>> It's technically possible but 0.6 does not support this, no. >>
>> What is the use case you are thinking of?
>>
>> On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Top=C3=A7u <utku@topcu.gen.tr>
>> wrote:
>> > Hi,
>> >
>> > I've been trying to use Cassandra for some kind of a supp= lementary input
>> > source for Hadoop MapReduce jobs.
>> >
>> > The default usage of the ColumnFamilyInputFormat does a full<= br> >> > columnfamily
>> > scan for using within the MapReduce framework as map input. >> >
>> > However I believe that, it should be possible to give a keyra= nge to scan
>> > the
>> > specified range.
>> >
>> > Is it anymeans possible?
>> >
>> > Best Regards,
>> >
>> > Utku
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra suppo= rt
>> http://riptano.co= m
>
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

--0016e6d7ea6a83442e048574c8ed--