Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AB8DB10135 for ; Tue, 24 Feb 2015 00:10:03 +0000 (UTC) Received: (qmail 68683 invoked by uid 500); 24 Feb 2015 00:09:58 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 68614 invoked by uid 500); 24 Feb 2015 00:09:58 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 68603 invoked by uid 99); 24 Feb 2015 00:09:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Feb 2015 00:09:58 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of msegel@segel.com designates 173.15.87.35 as permitted sender) Received: from [173.15.87.35] (HELO dbrack01.segel.com) (173.15.87.35) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Feb 2015 00:09:33 +0000 Received: from [10.84.154.123] (wireless [173.15.87.37]) by dbrack01.segel.com (Postfix) with ESMTPS id 02FFA232F0 for ; Mon, 23 Feb 2015 18:08:55 -0600 (CST) From: Michael Segel Content-Type: multipart/signed; boundary="Apple-Mail=_F67F28C9-8EA9-4338-B837-5F8D0D8F623C"; protocol="application/pkcs7-signature"; micalg=sha1 Message-Id: <7CFEE8EF-96B5-4820-AFA1-02295A38CC59@segel.com> Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\)) Subject: Re: data partitioning and data model Date: Mon, 23 Feb 2015 18:09:31 -0600 References: <54E789240200055E00390619_1_85670@p057> To: user@hbase.apache.org In-Reply-To: X-Mailer: Apple Mail (2.2070.6) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_F67F28C9-8EA9-4338-B837-5F8D0D8F623C Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Yes and no.=20 Its a bit more complicated and it is also data dependent and how = you=E2=80=99re using the data.=20 I wouldn=E2=80=99t go too thin and I wouldn=E2=80=99t go to fat.=20 > On Feb 20, 2015, at 2:19 PM, Alok Singh wrote: >=20 > You don't want a lot of columns in a write heavy table. HBase stores > the "row key" along with each cell/column (Though old, I find this > still useful: = http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html) > Having a lot of columns will amplify the amount of data being stored. >=20 > That said, if there are only going to be a handful of alert_ids for a > given "user_id+timestamp" row key, then you should be ok. >=20 > The query "Select * from table where user_id =3D X and timestamp > T = and > (alert_id =3D id1 or alert_id =3D id2)" can be accomplished with = either > design. See QualifierFilter and FuzzyRowFilter docs to get some ideas. >=20 > Alok >=20 > On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON) > wrote: >> Hi Alok, >>=20 >> Thanks for the answer. Yes, I have read this section, but it was a = little too abstract for me, I think I was needing to check my = understanding. Your answer helped me to confirm I am on the right path, = thanks for that. >>=20 >> One question: if instead of using user_id + timestamp + alert_id I = use user_id + timestamp as row key, I would still be able to store = alert_id + alert_data in columns, right? >>=20 >> I took the idea from the last section of this link: = http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-w= rite-environment/ >>=20 >> But I wonder which option would be better for my case. It seems = column scans are not so fast as row scans, but what would be the = advantages of one design over the other? >>=20 >> If I use something like: >> Row key: user_id + timestamp >> Column prefix: alert_id >> Column value: json with alert data >>=20 >> Would I be able to do a query like the one bellow? >> Select * from table where user_id =3D X and timestamp > T and = (alert_id =3D id1 or alert_id =3D id2) >>=20 >> Would I be able to do the same query using user_id + timestamp + = alert_id as row key? >>=20 >> Also, I know Cassandra supports up to 2 billion columns per row (2 = billion rows per partition in CQL), do you know what's the limit for = HBase? >>=20 >> Best regards, >> Marcelo Valle. >>=20 >> From: aloksingh@gmail.com >> Subject: Re: data partitioning and data model >>=20 >> You can use a key like (user_id + timestamp + alert_id) to get >> clustering of rows related to a user. To get better write throughput >> and distribution over the cluster, you could pre-split the table and >> use a consistent hash of the user_id as a row key prefix. >>=20 >> Have you looked at the rowkey design section in the hbase book : >> http://hbase.apache.org/book.html#rowkey.design >>=20 >> Alok >>=20 >> On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON) >> wrote: >>> Hello, >>>=20 >>> This is my first message in this mailing list, I just subscribed. >>>=20 >>> I have been using Cassandra for the last few years and now I am = trying to create a POC using HBase. Therefore, I am reading the HBase = docs but it's been really hard to find how HBase behaves in some = situations, when compared to Cassandra. I thought maybe it was a good = idea to ask here, as people in this list might know the differences = better than anyone else. >>>=20 >>> What I want to do is creating a simple application optimized for = writes (not interested in HBase / Cassandra product comparisions here, I = am assuming I will use HBase and that's it, just wanna understand the = best way of doing it in HBase world). I want to be able to write alerts = to the cluster, where each alert would have columns like: >>> - alert id >>> - user id >>> - date/time >>> - alert data >>>=20 >>> Later, I want to search for alerts per user, so my main query could = be considered to be something like: >>> Select * from alerts where user_id =3D $id and date/time > 10 days = ago. >>>=20 >>> I want to decide the data model for my application. >>>=20 >>> Here are my questions: >>>=20 >>> - In Cassandra, I would partition by user + day, as some users can = have many alerts and some just 1 or a few. In hbase, assuming all alerts = for a user would always fit in a single partition / region, can I just = use user_id as my row key and assume data will be distributed along the = cluster? >>>=20 >>> - Suppose I want to write 100 000 rows from a client machine and = these are from 30 000 users. What's the best manner to write these if I = want to optimize for writes? Should I batch all 100 k requests in one to = a single server? As I am trying to optimize for writes, I would like to = split these requests across several nodes instead of sending them all to = one. I found this article: = http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ = But not sure if it's what I need >>>=20 >>> Thanks in advance! >>>=20 >>> Best regards, >>> Marcelo. >>=20 >>=20 >=20 --Apple-Mail=_F67F28C9-8EA9-4338-B837-5F8D0D8F623C Content-Disposition: attachment; filename=smime.p7s Content-Type: application/pkcs7-signature; name=smime.p7s Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIICujCCArYw ggIfoAMCAQICCQCqTy7yM7wDDTANBgkqhkiG9w0BAQUFADB0MQswCQYDVQQGEwJVUzERMA8GA1UE CAwISWxsaW5vaXMxEDAOBgNVBAcMB0NoaWNhZ28xDTALBgNVBAoMBE1TQ0MxEDAOBgNVBAMMB3Nl Z2VsMDIxHzAdBgkqhkiG9w0BCQEWEG1zZWdlbEBzZWdlbC5jb20wHhcNMTMwOTA4MDYxMzA2WhcN MjMwOTA2MDYxMzA2WjB0MQswCQYDVQQGEwJVUzERMA8GA1UECAwISWxsaW5vaXMxEDAOBgNVBAcM B0NoaWNhZ28xDTALBgNVBAoMBE1TQ0MxEDAOBgNVBAMMB3NlZ2VsMDIxHzAdBgkqhkiG9w0BCQEW EG1zZWdlbEBzZWdlbC5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBANbFWB63Zy/IoPTu 4yvE2CU3ePSWb0jvSWlJRWeVRZtXDlIooFY9f1gtDPhtSW/2b67lKOfTfP0o72keQQbZSqU0S1OE Aw3NpTzvG2rRLF8Fex7bcRtdeW2ZCys8dQJIopB/nr15RAzcEM6aNFs8nUPDw1lKNRoqZTIe9ZW3 VHpBAgMBAAGjUDBOMB0GA1UdDgQWBBTpEjxThxqqZHnWTRn+xfboLLlaIzAfBgNVHSMEGDAWgBTp EjxThxqqZHnWTRn+xfboLLlaIzAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBBQUAA4GBAGWVijpU r12w+g5q/HlXUQjoCMoHphyWgbGK0PvucVBfyi3GnC7PqxFNQVfMOn9dBTYvK6FcFTZd77RTvniY K3F7YdzM2fObId90X1xTm9M7RsYWqY4r8v5ObmAyQl9WkcKgs7KFAPgsroKR3ghlayyZy3EyAVpk zL1QsPKCEA5AMYICtTCCArECAQEwgYEwdDELMAkGA1UEBhMCVVMxETAPBgNVBAgMCElsbGlub2lz MRAwDgYDVQQHDAdDaGljYWdvMQ0wCwYDVQQKDARNU0NDMRAwDgYDVQQDDAdzZWdlbDAyMR8wHQYJ KoZIhvcNAQkBFhBtc2VnZWxAc2VnZWwuY29tAgkAqk8u8jO8Aw0wCQYFKw4DAhoFAKCCAYkwGAYJ KoZIhvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTUwMjI0MDAwOTMxWjAjBgkq hkiG9w0BCQQxFgQU8XzG/1qN7fmfd1h35TNpQeFTTHIwgZIGCSsGAQQBgjcQBDGBhDCBgTB0MQsw CQYDVQQGEwJVUzERMA8GA1UECAwISWxsaW5vaXMxEDAOBgNVBAcMB0NoaWNhZ28xDTALBgNVBAoM BE1TQ0MxEDAOBgNVBAMMB3NlZ2VsMDIxHzAdBgkqhkiG9w0BCQEWEG1zZWdlbEBzZWdlbC5jb20C CQCqTy7yM7wDDTCBlAYLKoZIhvcNAQkQAgsxgYSggYEwdDELMAkGA1UEBhMCVVMxETAPBgNVBAgM CElsbGlub2lzMRAwDgYDVQQHDAdDaGljYWdvMQ0wCwYDVQQKDARNU0NDMRAwDgYDVQQDDAdzZWdl bDAyMR8wHQYJKoZIhvcNAQkBFhBtc2VnZWxAc2VnZWwuY29tAgkAqk8u8jO8Aw0wDQYJKoZIhvcN AQEBBQAEgYC3iX0VzLqiA0YL8yaZPo6+1FrgV1frgF8Hywui06CB8Aga+y+uwCYNyMM6umWcsv9S EaP/4hxruh0XmjF6ufbLwV2xSN2B1xUsiyi0e3j/cWIHoGdCI/QTsSMn4OFoD2K2Yw4qWQ85VJe/ mE5DMfUMa+i3dkMmpJ+Rb6KCxKhr4QAAAAAAAA== --Apple-Mail=_F67F28C9-8EA9-4338-B837-5F8D0D8F623C--