incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: Data modelling for range retrieval. Was: Re: Hadoop/Cassandra for data transformation (rather than analysis)?
Date Thu, 15 Aug 2013 02:58:49 GMT
> Is it good practice then to find an attribute in my data that would allow me to form wide
row row keys with aprox. 1000 values each?
You can do that using get_range_slice() via thrift. 
And via CQL 3 you use the token() function and Limit with a select statement. Check the DS
docs for more info (sorry dont have net access now). 

> My data items are products. Would it then make sense to, for example, try to use the
brand as a row key, storing all products of one brand in a single wide row? Then I could retrieve
the products by brand in ranges (not of equal size, but that is probably ok).
That makes sense. 
If your are comfortable with the Java code look at the org.apache.cassandra.hadoop.ColumnFamilyRecordReader$WideRowIterator
class in the code base. This is how the hadoop client deals with wide rows. 

> If there are too many products in one brand, I should partition the brands using a certain
number of brand name prefix characters (e.g. AAAA- GGGG, HAAA to …).
You should be ok into 10's of MB.

Hope that helps. 

-----------------
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/08/2013, at 5:32 PM, Jan Algermissen <jan.algermissen@nordsc.com> wrote:

> Aaron,
> 
> On 12.08.2013, at 23:17, Aaron Morton <aaron@thelastpickle.com> wrote:
> 
>>> As I do not have Billions of input records (but a max of 10 Milllion) the added
benefit of scaling out the per-line processing is probably not worth the additional setup
and operations effort of Hadoop. 
>> I would start with a regular app and then go to hadoop if needed, assuming you are
only dealing with a few MB's of data.
>> 
>> There can be a significant human startup cost to brining hadoop into a C* setup.
I recommend using http://www.datastax.com/what-we-offer/datastax-enterprise in a development
environment to see if it's something you want to do (requires a licence to run in prod). 
> 
> Thanks, that was my gut feeling. Thanks for the hint.
> 
> This raises the next question:
> 
> Since my use case is to iterate over all my keys I would like to retrieve the rows in
ranges, say 1000 per query, to parallize processing in the client.
> 
> Is it good practice then to find an attribute in my data that would allow me to form
wide row row keys with aprox. 1000 values each?
> 
> My data items are products. Would it then make sense to, for example, try to use the
brand as a row key, storing all products of one brand in a single wide row? Then I could retrieve
the products by brand in ranges (not of equal size, but that is probably ok).
> 
> If there are too many products in one brand, I should partition the brands using a certain
number of brand name prefix characters (e.g. AAAA- GGGG, HAAA to ...).
> 
> Makes sense?
> 
> Jan
> 
>> 
>> Cheers
>> 
>> 
>> 
>> -----------------
>> Aaron Morton
>> Cassandra Consultant
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 10/08/2013, at 8:15 PM, Jan Algermissen <jan.algermissen@nordsc.com> wrote:
>> 
>>> Hi,
>>> 
>>> I have a specific use case to address with Cassandra and I can't get my head
around whether using Hadoop on top creates any significant benefit or not.
>>> 
>>> Situation:
>>> 
>>> I have product data and each product 'contains' a number of articles (<100
/ product), representing individual colors/sizes etc.
>>> 
>>> My plan is to store each product in cassandra as a wide row, containing all the
 articles per product. I choose this design because sometimes I need to work with all the
articles in a product and sometimes I just need to pick one of them per product.
>>> 
>>> My understanding is that picking a certain 'row' from all the 'rows' in a wide
row is nice (because it works on a per-row basis) and that any other approach would require
a scan over essentially all the rows (not good).
>>> 
>>> So, after selecting one or  some or all of the 'rows' (articles) from every single
wide row (product) the input to my data processing is essentially a bunch articles.
>>> 
>>> The final output of the overall processing will be and export file (XML or CSV)
containing one line (or element) per article. There is no 'cross article' analysis going on,
it is really sort of one-in/on-out.
>>> 
>>> I am looking a Hadoop because I see MapReduce as a nice fit given the independence
of the per-article transformation into an output 'line'.
>>> 
>>> What I am worried about is whether Hadoop will actually give me a real benefit:
While there will be processing (mostly string operations) going on to vreate lines from articles,
the output still needs to be pulled over the wire to some place to create the single output
file. 
>>> 
>>> I wonder whether it would not work equally well to per-article pull the necessary
data from Cassandra and create the output file in a single process (in my case Java Web app).
As I do not have Billions of input records (but a max of 10 Milllion) the added benefit of
scaling out the per-line processing is probably not worth the additional setup and operations
effort of Hadoop. 
>>> 
>>> Any idea how I could make a judgement call here?
>>> 
>>> Another question: I read in a C* 1.1 related slidedeck that Hadoop output to
CFS is only possible with DSE and not with DSC - that with DSC the Hadoop output would be
HDFS. Is that correct?  For homogeneity, I would certainly want to store the output files
in CFS, too.
>>> 
>>> Sorry, that this was a bit of a longer question/explanation.
>>> 
>>> Jan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 


Mime
View raw message