cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From murat migdisoglu <>
Subject cassandra-hadoop mapper
Date Thu, 31 May 2012 08:59:28 GMT

I'm working on some use cases to understand how cassandra-hadoop
integration works.

I have a very basic scenario: I have a column family that keeps the session
id and some bson data that contains the username in two separate columns. I
want to go through all rows and dump the row to a file when the username is
matching to a certain criteria. And I don't need any Reducer or Combiner
for now.

After I've written the following very simple hadoop job, I see from the
logs that my mapper function is called per each row.  Is that normal? If
that is the case, doing such a search operation in a big dataset would take
hours if not days...Besides that, I see many small output files being
created on HDFS.

I guess i need a better understanding on how splitting the job into tasks
works exactly..

    public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns,
Context context)
    throws IOException, InterruptedException
        String rowkey = ByteBufferUtil.string(key);
        String ip = context.getConfiguration().
        IColumn column = columns.get(sourceColumn);
        if (column == null)
        ByteBuffer byteBuffer = column.value();
        ByteBuffer bb2 = byteBuffer.duplicate();

        DataConvertor convertor= fromBson(byteBuffer,
        String username= convertor.getUsername();
        BytesWritable value = new BytesWritable();
        if (username != null && username.equals(cip)) {
            byte[] arr = convertToByteArray(bb2);
            value.set(new BytesWritable(arr));
            Text tkey = new Text(rowkey);
            context.write( tkey, value);
        } else {
  "ip not match [" + ip + "]");

Thanks in advance
Kind Regards

"Find a job you enjoy, and you'll never work a day in your life."

View raw message