incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <aa...@thelastpickle.com>
Subject Re: Hadoop/Cassandra for data transformation (rather than analysis)?
Date Mon, 12 Aug 2013 21:17:27 GMT
>  As I do not have Billions of input records (but a max of 10 Milllion) the added benefit
of scaling out the per-line processing is probably not worth the additional setup and operations
effort of Hadoop. 
I would start with a regular app and then go to hadoop if needed, assuming you are only dealing
with a few MB's of data.

There can be a significant human startup cost to brining hadoop into a C* setup. I recommend
using http://www.datastax.com/what-we-offer/datastax-enterprise in a development environment
to see if it's something you want to do (requires a licence to run in prod). 

Cheers


 
-----------------
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/08/2013, at 8:15 PM, Jan Algermissen <jan.algermissen@nordsc.com> wrote:

> Hi,
> 
> I have a specific use case to address with Cassandra and I can't get my head around whether
using Hadoop on top creates any significant benefit or not.
> 
> Situation:
> 
> I have product data and each product 'contains' a number of articles (<100 / product),
representing individual colors/sizes etc.
> 
> My plan is to store each product in cassandra as a wide row, containing all the  articles
per product. I choose this design because sometimes I need to work with all the articles in
a product and sometimes I just need to pick one of them per product.
> 
> My understanding is that picking a certain 'row' from all the 'rows' in a wide row is
nice (because it works on a per-row basis) and that any other approach would require a scan
over essentially all the rows (not good).
> 
> So, after selecting one or  some or all of the 'rows' (articles) from every single wide
row (product) the input to my data processing is essentially a bunch articles.
> 
> The final output of the overall processing will be and export file (XML or CSV) containing
one line (or element) per article. There is no 'cross article' analysis going on, it is really
sort of one-in/on-out.
> 
> I am looking a Hadoop because I see MapReduce as a nice fit given the independence of
the per-article transformation into an output 'line'.
> 
> What I am worried about is whether Hadoop will actually give me a real benefit: While
there will be processing (mostly string operations) going on to vreate lines from articles,
the output still needs to be pulled over the wire to some place to create the single output
file. 
> 
> I wonder whether it would not work equally well to per-article pull the necessary data
from Cassandra and create the output file in a single process (in my case Java Web app). As
I do not have Billions of input records (but a max of 10 Milllion) the added benefit of scaling
out the per-line processing is probably not worth the additional setup and operations effort
of Hadoop. 
> 
> Any idea how I could make a judgement call here?
> 
> Another question: I read in a C* 1.1 related slidedeck that Hadoop output to CFS is only
possible with DSE and not with DSC - that with DSC the Hadoop output would be HDFS. Is that
correct?  For homogeneity, I would certainly want to store the output files in CFS, too.
> 
> Sorry, that this was a bit of a longer question/explanation.
> 
> Jan
> 
> 
> 
> 
> 
> 


Mime
View raw message