cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna
Date Mon, 25 Jul 2011 19:42:59 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.

The "HadoopSupport" page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=35&rev2=36

Comment:
Removing old troubleshooting tip about pre 0.6.2 connection leak and added remarks about range
scans and CL.

   * '''cassandra.range.batch.size''' - the default is 4096, but you may need to lower this
depending on your data.  This is either specified in your hadoop configuration or using `org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
   * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 it's `RpcTimeoutInMillis`
in `storage-conf.xml`).  The rpc timeout is not for timing out from the client but between
nodes.  This can be increased to reduce chances of timing out.
  
+ If you are seeing inconsistent data coming back, consider the consistency level that you
are reading and writing at.  The two relevant properties are:
+  * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE.
+  * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.
+ Also since hadoop integration uses range scans underneath which do not do read repair. 
However reading at !ConsistencyLevel.QUORUM will reconcile differences among nodes read. 
See ReadRepair section as well as the !ConsistencyLevel section of the [[http://wiki.apache.org/cassandra/API|API]]
page for more details.
- Releases before 0.6.2/0.7 are affected by a small resource leak that may cause jobs to fail
(connections are not released  properly, causing a resource leak). Depending on your local
setup you may hit this issue, and workaround it by raising the limit of open file descriptors
for the process (e.g. in linux/bash using `ulimit -n 32000`).  The error will be reported
on the hadoop job side as a thrift !TimedOutException.
- 
- If you are testing the integration against a single node and you obtain some failures, this
may be normal: you are probably overloading the single machine, which may again result in
timeout errors. You can workaround it by reducing the number of concurrent tasks
- 
- {{{
-              Configuration conf = job.getConfiguration();
-              conf.setInt("mapred.tasktracker.map.tasks.maximum",1);
- }}}
- Also, you may reduce the size in rows of the batch you  are reading from cassandra
- 
- {{{
-              ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
- }}}
  
  [[#Top|Top]]
  

Mime
View raw message