cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "platon.tema" <platon.t...@yandex.ru>
Subject Re: Direct IO with Spark and Hadoop over Cassandra
Date Tue, 16 Sep 2014 13:19:08 GMT
Yes, updates and deletes is trouble. At the moment for updates 
collection we refresh result data by query to C* (java driver) before 
reporting to user. For deletes we can skip it during scanning by TTL for 
example (not tested yet).

On 09/16/2014 04:53 PM, moshe.kranc@barclays.com wrote:
>
> You will also have to read/resolve multiple row instances (if you 
> update records) and tombstones (if you delete records) yourself.
>
> *From:*platon.tema [mailto:platon.tema@yandex.ru]
> *Sent:* Tuesday, September 16, 2014 1:51 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Direct IO with Spark and Hadoop over Cassandra
>
> Thanks.
>
> But 1) overcomes with C* API for commitlog and memtables or with mixed 
> access (direct IO + traditional connectors or pure CQL if data model 
> allows, we experimented with it).
>
> 2) is more complex for universal solution. In our case C* uses without 
> replication (RF=1) because of huge data size (replication too expensive).
>
> On 09/16/2014 03:40 PM, DuyHai Doan wrote:
>
>     If you access directly the C* sstables from those frameworks, you
>     will:
>
>     1) miss live data which are in memory and not dumped yet to disk
>
>     2) skip the Dynamo layer of C* responsible for data consistency
>
>     Le 16 sept. 2014 10:58, "platon.tema" <platon.tema@yandex.ru
>     <mailto:platon.tema@yandex.ru>> a écrit :
>
>     Hi.
>
>     As I see massive data processing tools (map\reduce) with C* data
>     include
>
>     connectors
>     - Calliope http://tuplejump.github.io/calliope/
>     - Datastax spark cassandra connector
>     https://github.com/datastax/spark-cassandra-connector
>     - Startio Deep https://github.com/Stratio/stratio-deep
>     - other free\commercial
>
>     runtime (job management and infrastructure)
>     - Spark
>     - Hadoop
>
>     But if I'm not mistaken all these solutions use network for data
>     loading. In best case logic instance (some "job") run on the same
>     node (wherethe corresponding range was found).
>
>     Why this logic can`t use direct C* IO (sstable reading from disk)?
>     Any cons ?
>
>     Some time ago i read article (still can't find it) about
>     academical research within Hadoop was modified to support this
>     direct IO mode. According to that benchmarks direct IOgave a
>     significant performance increase.
>
> _______________________________________________
>
> This message is for information purposes only, it is not a 
> recommendation, advice, offer or solicitation to buy or sell a product 
> or service nor an official confirmation of any transaction. It is 
> directed at persons who are professionals and is not intended for 
> retail customer use. Intended for recipient only. This message is 
> subject to the terms at: www.barclays.com/emaildisclaimer 
> <http://www.barclays.com/emaildisclaimer>.
>
> For important disclosures, please see: 
> www.barclays.com/salesandtradingdisclaimer 
> <http://www.barclays.com/salesandtradingdisclaimer> regarding market 
> commentary from Barclays Sales and/or Trading, who are active market 
> participants; and in respect of Barclays Research, including 
> disclosures relating to specific issuers, please see 
> http://publicresearch.barclays.com.
>
> _______________________________________________
>


Mime
View raw message