Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of jlist@streamy.com designates
 72.34.249.3 as permitted sender)
From: "Jonathan Gray" <jlist@streamy.com>
To: <hbase-user@hadoop.apache.org>
References: <40d25e011003031524o2cdaf4a8tfe5b7f8d567b2896@mail.gmail.com>
In-Reply-To: <40d25e011003031524o2cdaf4a8tfe5b7f8d567b2896@mail.gmail.com>
Subject: RE: timestamps / versions revisited
Date: Wed, 3 Mar 2010 16:48:21 -0800
Message-ID: <06fc01cabb34$639827d0$2ac87770$@com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
thread-index: Acq7KMh5HfQSXbSERZS8uj8Iq9VtbgACuz5A
Content-Language: en-us

Yes, you could have issues if data has the same timestamp (only one of =
them being returned).

As far as inserting things not in chronological order, there are no =
issues if you are doing scans and not deleting anything.  If you're =
asking for the latest version of something with a Get, there are some =
edge cases where an earlier timestamped cell could be returned because =
it was inserted after a later timestamp.  There are also some issues =
where a delete could stick around and override an insert that followed =
the delete.

Because of these issues, one thing you can consider to give yourself =
more control and not be subject to the current limitations/behavior of =
the built-in timestamps is to push your custom timestamps into your row =
keys or columns.

Would need to know more about your schema to help further with that but =
I often recommend that if you are manually controlling the timestamps.

JG

-----Original Message-----
From: Bluemetrix Development [mailto:bmdevelopment@gmail.com]=20
Sent: Wednesday, March 03, 2010 3:25 PM
To: hbase-user@hadoop.apache.org
Subject: timestamps / versions revisited

Hi,
I've been having a few issues with certain versions of data "missing"
and/or less data than expected in my tables. I've read over quite a
few old threads and I think I understand what is actually happening,
but just wanted to possibly confirm my thinking. I hope I am not
rehashing too many old topics here.

First, I am using the timestamp as an added dimension to my data.
Basically, its log data and the timestamp of the log entry is the same
as the timestamp for each cell. I want to keep all the data, so I have
max versions for the column set to Long.MAX_VALUE.

For my main table, there should at most only be one cell per second,
so the versioning here is working as expected. However, for my other
tables I could have many, 1000s per second, cells.

My understanding now is that if I have two cells with the same exact =
timestamp,
that one will be removed with a major compact (which would explain why =
data
had seem to be "missing" to me). Values are different and so
I end up losing data here when the duplicate timestamped cells are =
removed.
To solve this, I plan to use microseconds which should solve the
problem of duplicates
and no cells being removed on major compact.

However, I'm wondering if I have another problem. To insert data, I
run a MapReduce
process and the data is not inserted in any chronological order. If
timestamps are
inserted in random order, will this possibly result in cells being
removed at major compact?
I think I recall something to the effect that if a previous dated
timestamp is inserted after
a future timestamp, that the previous timestamped cell is ignored (or =
removed on
major compact)? Is that the case, or have I totally made that part up?
If so, it would seem that I should not use the timestamp/version at
all, but rather create
another column for this time data and just use current time in
microseconds for the timestamp
at insertion.

Finally, with regards to number of versions. I could possibly have 10s
or 100s of millions of
versions per row. If I pull back the row and try to loop the KVs, is
this going to read the whole
thing into memory at once and possibly cause OOME?

Apologies for the long mail.
Thanks
J