cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawna Qian <>
Subject SSTableWriter to hdfs
Date Thu, 10 May 2012 13:13:01 GMT

Can I use sstableunsortedwriter to write the data directly to hdfs or I have to use hdfs copyfromlocal
to copy the sstable file from local dist to hdfs after they get generated?


Sent from my iPhone

On May 7, 2012, at 3:48 AM, "aaron morton" <<>>

Can you copy the sstables as a task after the load operation ? You should know where the files

The are multiple files may be created by the writer during the loading process. So running
code that performs a long running action will impact on the time taken to pump data through
the SSTableSimpleUnsortedWriter.

wrt the patch, the best place to start the conversation for this is on <>

Thanks taking the time to look into this.


Aaron Morton
Freelance Developer

On 3/05/2012, at 11:40 PM, Benoit Perroud wrote:

Hi All,

I'm bulk loading (a lot of) data from Hadoop into Cassandra 1.0.x. The
provided CFOutputFormat is not the best case here, I wanted to use the
bulk loading feature. I know 1.1 comes with a BulkOutputFormat but I
wanted to propose a simple enhancement to SSTableSimpleUnsortedWriter
that could ease life :

When the table is flushed into the disk, it could be interesting to
have listeners that could be triggered to perform any action (copying
my SSTable into HDFS for instance).

Please have a look at the patch below to give a better idea. Do you
think it could worth while opening a jira for this ?

Regarding 1.1 BulkOutputFormat and bulk in general, the work done to
have light client to stream into the cluster is really great. The
issue now is that data is streamed at the end of the task only. This
cause all the tasks storing the data locally and streaming everything
at the end. Lot's of temporary space may be needed, and lot of
bandwidth to the nodes are used at the "same" time. With the listener,
we would be able to start streaming as soon the first table is
created. That way the streaming bandwidth could be better balanced.
Jira for this also ?



--- a/src/java/org/apache/cassandra/io/sstable/
+++ b/src/java/org/apache/cassandra/io/sstable/
@@ -21,6 +21,8 @@ package;
import java.nio.ByteBuffer;
+import java.util.LinkedList;
+import java.util.List;
import java.util.Map;
import java.util.TreeMap;

@@ -47,6 +49,8 @@ public class SSTableSimpleUnsortedWriter extends
    private final long bufferSize;
    private long currentSize;

+    private final List<SSTableWriterListener> sSTableWrittenListeners
= new LinkedList<SSTableWriterListener>();
     * Create a new buffering writer.
     * @param directory the directory where to write the sstables
@@ -123,5 +127,16 @@ public class SSTableSimpleUnsortedWriter extends
        currentSize = 0;
+        // Notify the registered listeners
+        for (SSTableWriterListener listeners : sSTableWrittenListeners)
+        {
writer.getColumnFamilyName(), writer.getFilename());
+        }
+    }
+    public void addSSTableWriterListener(SSTableWriterListener listener)
+    {
+       sSTableWrittenListeners.add(listener);
diff --git a/src/java/org/apache/cassandra/io/sstable/
new file mode 100644
index 0000000..6628d20
--- /dev/null
+++ b/src/java/org/apache/cassandra/io/sstable/
@@ -0,0 +1,9 @@
+public interface SSTableWriterListener {
+       void onSSTableWrittenAndClosed(final String tableName, final
String columnFamilyName, final String filename) throws IOException;

View raw message