Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 48424 invoked from network); 4 Mar 2010 00:49:04 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 Mar 2010 00:49:04 -0000 Received: (qmail 90207 invoked by uid 500); 4 Mar 2010 00:48:55 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 90173 invoked by uid 500); 4 Mar 2010 00:48:55 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 90163 invoked by uid 99); 4 Mar 2010 00:48:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Mar 2010 00:48:55 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jlist@streamy.com designates 72.34.249.3 as permitted sender) Received: from [72.34.249.3] (HELO mail.streamy.com) (72.34.249.3) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Mar 2010 00:48:47 +0000 Received: from jgraymegadesk (pool-72-67-69-108.lsanca.fios.verizon.net [72.67.69.108]) by ns1.streamy.com (8.13.1/8.13.1) with ESMTP id o240mME6004753 for ; Wed, 3 Mar 2010 16:48:22 -0800 From: "Jonathan Gray" To: References: <40d25e011003031524o2cdaf4a8tfe5b7f8d567b2896@mail.gmail.com> In-Reply-To: <40d25e011003031524o2cdaf4a8tfe5b7f8d567b2896@mail.gmail.com> Subject: RE: timestamps / versions revisited Date: Wed, 3 Mar 2010 16:48:21 -0800 Message-ID: <06fc01cabb34$639827d0$2ac87770$@com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 thread-index: Acq7KMh5HfQSXbSERZS8uj8Iq9VtbgACuz5A Content-Language: en-us X-Spam-Report: * 0.5 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL * [72.67.69.108 listed in zen.spamhaus.org] * 1.1 FH_HOST_EQ_VERIZON_P Host is pool-.+verizon.net * 3.4 FH_DATE_PAST_20XX The date is grossly in the future. * 1.6 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP address * [72.67.69.108 listed in dnsbl.sorbs.net] * 0.1 RDNS_DYNAMIC Delivered to trusted network by host with * dynamic-looking rDNS X-Spam-Level: ****** X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on ns1.streamy.com X-Old-Spam-Flag: YES X-Old-Spam-Status: Yes, score=6.7 required=5.0 tests=FH_DATE_PAST_20XX, FH_HOST_EQ_VERIZON_P,RCVD_IN_PBL,RCVD_IN_SORBS_DUL,RDNS_DYNAMIC autolearn=no version=3.2.5 Yes, you could have issues if data has the same timestamp (only one of = them being returned). As far as inserting things not in chronological order, there are no = issues if you are doing scans and not deleting anything. If you're = asking for the latest version of something with a Get, there are some = edge cases where an earlier timestamped cell could be returned because = it was inserted after a later timestamp. There are also some issues = where a delete could stick around and override an insert that followed = the delete. Because of these issues, one thing you can consider to give yourself = more control and not be subject to the current limitations/behavior of = the built-in timestamps is to push your custom timestamps into your row = keys or columns. Would need to know more about your schema to help further with that but = I often recommend that if you are manually controlling the timestamps. JG -----Original Message----- From: Bluemetrix Development [mailto:bmdevelopment@gmail.com]=20 Sent: Wednesday, March 03, 2010 3:25 PM To: hbase-user@hadoop.apache.org Subject: timestamps / versions revisited Hi, I've been having a few issues with certain versions of data "missing" and/or less data than expected in my tables. I've read over quite a few old threads and I think I understand what is actually happening, but just wanted to possibly confirm my thinking. I hope I am not rehashing too many old topics here. First, I am using the timestamp as an added dimension to my data. Basically, its log data and the timestamp of the log entry is the same as the timestamp for each cell. I want to keep all the data, so I have max versions for the column set to Long.MAX_VALUE. For my main table, there should at most only be one cell per second, so the versioning here is working as expected. However, for my other tables I could have many, 1000s per second, cells. My understanding now is that if I have two cells with the same exact = timestamp, that one will be removed with a major compact (which would explain why = data had seem to be "missing" to me). Values are different and so I end up losing data here when the duplicate timestamped cells are = removed. To solve this, I plan to use microseconds which should solve the problem of duplicates and no cells being removed on major compact. However, I'm wondering if I have another problem. To insert data, I run a MapReduce process and the data is not inserted in any chronological order. If timestamps are inserted in random order, will this possibly result in cells being removed at major compact? I think I recall something to the effect that if a previous dated timestamp is inserted after a future timestamp, that the previous timestamped cell is ignored (or = removed on major compact)? Is that the case, or have I totally made that part up? If so, it would seem that I should not use the timestamp/version at all, but rather create another column for this time data and just use current time in microseconds for the timestamp at insertion. Finally, with regards to number of versions. I could possibly have 10s or 100s of millions of versions per row. If I pull back the row and try to loop the KVs, is this going to read the whole thing into memory at once and possibly cause OOME? Apologies for the long mail. Thanks J