hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "鞠適存" <chihchun....@gmail.com>
Subject Re: data duplicate?
Date Mon, 01 Dec 2008 07:33:59 GMT
Hi, Andy,

Thanks for your kindly response. I know hbase allows multiple versions of
data.
I'm sorry that I didn't describe my question clearly in the previous post. I
wrote my
codes as follow:
      ...
      long timestamp=System.nanotime();
      BatchUpdate bu = new BatchUpdate(splits[0], timestamp);
      int i;
      for (i=1;i<splits.length;i++){

          if (((i%2)==0) && (i>1)){

             bu.put("UserData:"+splits[i-1], splits[i].getBytes());

          }
        }

      table.commit( bu );
      ....

My code will read a flat file with the format: <row-key> <col1_key>
<col1_content> <col2_key> <col2_content>...
Since the timestamp is set on a row (BatchUpdate a row), why the data be
stroed duplicated?


2008/11/28 Andrew Purtell <apurtell@apache.org>

> Chu,
>
> There is no uniqueness test performed when data is stored
> into a cell. If your schema allows multiple versions and
> you store the same data into the cell more than once at
> different times, you will get back in response to queries
> the "duplicates" such as you presented.
>
> If you are trying to avoid duplicates, use a row key that
> uniquely identifies an object (such as a SHA-1 hash) and
> set MAX_VERSIONS on the column that should contain only
> one canonical entry to 1. Then if you store the same data
> item more than once, a replacement will happen instead of
> an addition.
>
> Hope this helps,
>
>   - Andy
>
> > From: 鞠適存 <chihchun.chu@gmail.com>
> > Subject: data duplicate?
> > To: hbase-user@hadoop.apache.org
> > Date: Thursday, November 27, 2008, 7:31 PM
> > Hi,
> >
> > I revised the sample code "Bulk Import" written
> > by Allen Day to upload a
> > flat data file to a hbase table.
> > My table schema is designed as: <row key>
> > <ColFamily1:colKey> <ColFamily2:
> > colkey>.
> > The table description found by hbase shell is as follow:
> > {NAME => 'ATCGeo', IS_ROOT =>
> > 'false', IS_META => 'false', FAMILIES
> > =>
> > [{NAME => 'photo_id', BLOOMFILTER => 'f
> > alse', VERSIONS => '30000', COMPRESSION
> > => 'NONE', LENGTH => '2147483647',
> > TTL => '-1', IN_MEMORY => 'true', B
> > LOCKCACHE => 'true'}, {NAME =>
> > 'trail_id', BLOOMFILTER => 'false',
> > VERSIONS
> > => '30000', COMPRESSION => 'NONE',
> >  LENGTH => '2147483647', TTL => '-1',
> > IN_MEMORY => 'true', BLOCKCACHE =>
> > 'true'}]}
> >
> > Some of the data was been found as duplicate-with the same
> > content but the
> > different timestamp. For example,
> > I use the: get '<table>',
> > '<rowkey>',{COLUMN=>'col1',VERSION=>30000}
> > the results are:
> > timestamp=3090896685592411,
> >
> value=/media/streetimage/processed/streettester/2008_08_07_12_26_21_C/2265.jpg
> >
> > timestamp=3090896682597411,
> >
> value=/media/streetimage/processed/streettester/2008_08_07_12_26_21_C/2264.jpg
> >
> > timestamp=3090731558521386,
> >
> value=/media/streetimage/processed/streettester/2008_08_07_12_26_21_C/2265.jpg
> >
> > timestamp=3090731556503386,
> >
> value=/media/streetimage/processed/streettester/2008_08_07_12_26_21_C/2264.jpg
> >
> > I am sure that the data in original file is unique. Could
> > anyone tell me what's the possible reasons?
> > Would appreciate any help!
> >
> > Chu
>
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message