nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney <doga...@gmail.com>
Subject Re: State of nutchbase
Date Tue, 08 Dec 2009 01:11:50 GMT
Hey everyone,

So I restarted nutchbase efforts with adding an abstraction to the hbase
api. The idea is to use an intermediate nutch api (which then talks with
hbase) instead of communicating with hbase directly. This allows us a) to
not be completely tied down to hbase, making a move to another db in the
future easier b) perhaps to immediately support multiple databases with easy
data migration between them.

What I have is very very (VERY) early and extremely alpha but I am quite
happy with overall idea so I am sharing it for suggestions and reviews.
Again, instead of using hbase directly, nutch will use a nice java bean with
getters and setters. Nutch will then figure out what to read/write into
hbase.

I decided to use avro because it has a very clean design. Here is a  very
basic WebTableRow class:
{"namespace": "org.apache.nutch.storage",
 "protocol": "Web",

 "types": [
     {"name": "WebTableRow", "type": "record",
      "fields": [
          {"name": "rowKey", "type": "string"},
          {"name": "fetchTime", "type": "long"},
          {"name": "title", "type": "string"},
          {"name": "text", "type": "string"},
          {"name": "status", "type": "int"}
      ]
     }
 ]
}

(ignore "protocol". I haven't yet figured out how to compile schemas without
protocols)

I have copied and modified avro's SpecificCompiler to generate a java class.
It is mostly the same class as avro's SpecificCompiler however the variables
are all private and are accessed through getters and setters. Here is a
portion of the file:

public class WebTableRow extends NutchTableRow< Utf8> implements
SpecificRecord {
  @RowKey // these are used for reflection
  private Utf8 rowKey;
  @RowField
  private long fetchTime;
  @RowField
  private Utf8 title;
  @RowField
  private Utf8 text;
  @RowField
  private int status;
  public Utf8 getRowKey() { .... }
  public void setRowKey(Utf8 value) {....}
  public long getFetchTime() { .... }
  public void setFetchTime(long value) { .... }
  .....

Note that NutchTableRow extends SpecificRecordBase so this is a proper avro
record. In the future, once hadoop MR supports avro as a serialization
format NutchTableRow-s can easily be output through maps and reduces which
is a nice bonus.

We need to force the usage of setters instead of direct access to variables.
Because one of the nice things about hbase is that you only update the
columns that you changed. However to know which fields are updated (and
thus, map them to hbase columns), we must keep track of what changed.
Currently, NutchTableRow keeps a BitSet for all fields and all setter
functions update this BitSet so we know exactly what changed.

There is also a new interface called NutchSerializer that defines readRow
and writeRow methods(it also needs scans, delete rows etc.. but that's for
later). Currently HbaseSerializer implements NutchSerializer and reads and
writes WebTableRow-s. HbaseSerializer currently works via reflection. It
should be easy to add code generation to our SpecificCompiler so that we can
also output a WebTableRowHbaseSerializer along with WebTableRow instead of
using reflection.

What I have currently can read and write primitive types + strings into and
from hbase. You can check it out from github.com/dogacan/nutchbase (branch
master, package o.a.n.storage). Again, I would like to note that the code is
very very alpha and is not in a good shape but it should be a good starting
point if you are interested.

Once hbase support is solid, I intend to add support for other databases
(bdb, cassandra and sql come to mind). If I got everything right, then
moving data from one database to another is an incredibly trivial task. So,
you can start with, say, bdb then switch over to hbase once your data gets
large.

Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that describes
the mapping between fields and hbase columns:

<table name="webtable" class="org.apache.nutch.storage.WebTableRow">
  <description>
    <family name="p"/> <!-- This can also have params like compression,
bloom filters -->
    <family name="f"/>
  </description>
  <fields>
    <field name="fetchTime" family="f" qualifier="ts"/>
    <field name="title" family="p" qualifier="t"/>
    <field name="text" family="p" qualifier="c"/>
    <field name="status" family="f" qualifier="st"/>
  </fields>

Sorry for the long and rambling email. Feel free to ask if anything is
unclear (and I assume it must be, given my incoherent description :)
-- 
Doğacan Güney

Mime
View raw message