From Chris Perluss <tradersan...@gmail.com>
Subject Re: Data Deduplication in HBase
Date Wed, 28 Aug 2013 08:26:36 GMT
It might help to pick a granularity level. For example let's suppose you
pick a granularity level of 0.1.

Any piece of the song you receive should be broken down into segments of
0.1 and they need to be aligned on 0.1.

Example: you receive a piece of the song from 0.65 to 0.85.
You would break this into three segments:
0.65 to 0.70
0.70 to 0.80
0.80.to 0.85

These three segments would get written to three different rows.  The row
key would be the song identifier followed by the segment number.  The first
row would be "songId-0.6", the second "songId-0.7" , and the third

The first row is "songId-0.6" and not "songId-0.65" because you want all
pieces of the song between 0.6 and 0.7 to end up in the same row.  You do
this by rounding down to the start range for the segment.

When writing the three example segments to HBase there will be two

The first scenario is that you have an entire segment to be saved. In the
above example this is the case for your piece that spans the 0.7 to 0.8
segment.  Since you have the entire segment you don't have to combine it
with any existing data.  So you can simply do a put and overwrite any
partial data that might happen to exist in that row.  If you configure your
column family to only store one version for each cell then this will
perform "deduping" for that segment because it will only keep your new,
complete version of that segment.

The other scenario is that you receive a part of a segment.  In this case
you will need to read in the row corresponding to your segment, combine
your new partial segment with any existing partial segment, then put the
combined segment back into hbase.
In the above example this applies to the 0.65 to 0.7 segment (and the 0.8
to 0.85 segment).

When you read the row at "songId-0.6",  if there is already data there you
will need to combine it with your new data.  E.g. if you found 0.63 to 0.67
you would combine it with 0.65 to 0.70 and end up with 0.63 to 0.70.  Then
write this segment back to hbase. If you have versions set to 1 then this
bigger segment will replace the smaller segment you had before, thus
"deduping" that particular segment.

If you think overlapping segments will potentially be uploaded at the same
time then you will need to implement an optimistic locking model using
checkAndPut.  I would do this by defining one column to contain the song
data and another column to contain a row version.  I can go into more
detail if requested.

Here's the benefit of this design:
1.  Each row will have approximately the same size (KB of song data).  E.g.
you don't have to worry about someone uploading a 2 hour long epic folk
metal song (I'm looking at you, Moonsorrow!) and thus creating a cell too
big for hbase to efficiently handle.  This 2 hour long song will be broken
up over lots of rows.
2.  You can tune the row size by changing the granularity (before you go
into production!)
3.  For each upload request you will only need to Get a max of two segments
from HBase in order to append to a partial segment.  The only segments you
need to Get would be the partial segment at the beginning of the upload and
the partial segment at the end of the upload.  All segments between these
two are complete segments and thus can just Put their entire contents into
the right row.
4.  Since you only Get two segments you will only read in a few 100 kbs of
data in order to perform the update (amount read in depends on your
granularity). This is true no matter how much of the file has already been
uploaded.  In a non segmented storage scenario where you stored the entire
file in one cell, if 30 MB had already been uploaded then a request to
upload an additional 100 KB would require reading in all 30MB and writing
back all 30.1 MB back to HBase.
5. You can easily and efficiently retrieve a completed song by performing a
Scan using the songId.  Ie, Scan(rowStart="songId-", rowEnd="songId.")
"." Is the next ascii char after "-".

Hope this helps!
 On Aug 27, 2013 10:13 PM, "Anand Nalya" <anand.nalya@gmail.com> wrote:

> The segments are completely random. The segments can have from no overlap
> to exact duplicates.
> Anand
> On 27 August 2013 19:49, Ted Yu <yuzhihong@gmail.com> wrote:
> > bq.  Will hbase do some sort of deduplication?
> >
> > I don't think so.
> >
> > What is the granularity of segment overlap ? In the above example, it
> seems
> > to be 0.5
> >
> > Cheers
> >
> >
> > On Tue, Aug 27, 2013 at 7:12 AM, Anand Nalya <anand.nalya@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I have a use case in which I need to store segments of mp3 files in
> > hbase.
> > > A song may come to the application in different ovelapping segments.
> For
> > > example, a 5 min song can have the following segments
> 0-1,0.5-2,2-4,3-5.
> > As
> > > seen, some of the data is duplicate (3-4 is present in the last 2
> > > segments).
> > >
> > > What would be the ideal way of removing this duplicate storage? Will
> > snappy
> > > compression help here or do I need to write some logic over HBase?
> Also,
> > > what if I store a single segment multiple times. Will hbase do some
> sort
> > of
> > > deduplication?
> > >
> > > Regards,
> > > Anand
> > >
> >

