Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of msegel@segel.com designates
 173.15.87.35 as permitted sender)
From: Michael Segel <msegel@segel.com>
Content-Type: multipart/signed;
 boundary="Apple-Mail=_F67F28C9-8EA9-4338-B837-5F8D0D8F623C";
 protocol="application/pkcs7-signature"; micalg=sha1
Message-Id: <7CFEE8EF-96B5-4820-AFA1-02295A38CC59@segel.com>
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\))
Subject: Re: data partitioning and data model
Date: Mon, 23 Feb 2015 18:09:31 -0600
References: <54E789240200055E00390619_1_85670@p057>
 <CAJ_Kd+F6Ehjt+m1dQzd1k8CdYfCwF03_zL3CutNjq46WkZzVsg@mail.gmail.com>
To: user@hbase.apache.org
In-Reply-To: 
 <CAJ_Kd+F6Ehjt+m1dQzd1k8CdYfCwF03_zL3CutNjq46WkZzVsg@mail.gmail.com>

--Apple-Mail=_F67F28C9-8EA9-4338-B837-5F8D0D8F623C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Yes and no.=20

Its a bit more complicated and it is also data dependent and how =
you=E2=80=99re using the data.=20

I wouldn=E2=80=99t go too thin and I wouldn=E2=80=99t go to fat.=20

> On Feb 20, 2015, at 2:19 PM, Alok Singh <aloksingh@gmail.com> wrote:
>=20
> You don't want a lot of columns in a write heavy table. HBase stores
> the "row key" along with each cell/column (Though old, I find this
> still useful: =
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
> Having a lot of columns will amplify the amount of data being stored.
>=20
> That said, if there are only going to be a handful of alert_ids for a
> given "user_id+timestamp" row key, then you should be ok.
>=20
> The query "Select * from table where user_id =3D X and timestamp > T =
and
> (alert_id =3D id1 or alert_id =3D id2)" can be accomplished with =
either
> design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.
>=20
> Alok
>=20
> On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
> <mvallemilita@bloomberg.net> wrote:
>> Hi Alok,
>>=20
>> Thanks for the answer. Yes, I have read this section, but it was a =
little too abstract for me, I think I was needing to check my =
understanding. Your answer helped me to confirm I am on the right path, =
thanks for that.
>>=20
>> One question: if instead of using user_id + timestamp + alert_id  I =
use user_id + timestamp as row key, I would still be able to store =
alert_id + alert_data in columns, right?
>>=20
>> I took the idea from the last section of this link: =
http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-w=
rite-environment/
>>=20
>> But I wonder which option would be better for my case. It seems =
column scans are not so fast as row scans, but what would be the =
advantages of one design over the other?
>>=20
>> If I use something like:
>> Row key: user_id + timestamp
>> Column prefix: alert_id
>> Column value: json with alert data
>>=20
>> Would I be able to do a query like the one bellow?
>> Select * from table where user_id =3D X and timestamp > T and =
(alert_id =3D id1 or alert_id =3D id2)
>>=20
>> Would I be able to do the same query using user_id + timestamp + =
alert_id as row key?
>>=20
>> Also, I know Cassandra supports up to 2 billion columns per row (2 =
billion rows per partition in CQL), do you know what's the limit for =
HBase?
>>=20
>> Best regards,
>> Marcelo Valle.
>>=20
>> From: aloksingh@gmail.com
>> Subject: Re: data partitioning and data model
>>=20
>> You can use a key like (user_id + timestamp + alert_id) to get
>> clustering of rows related to a user. To get better write throughput
>> and distribution over the cluster, you could pre-split the table and
>> use a consistent hash of the user_id as a row key prefix.
>>=20
>> Have you looked at the rowkey design section in the hbase book :
>> http://hbase.apache.org/book.html#rowkey.design
>>=20
>> Alok
>>=20
>> On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
>> <mvallemilita@bloomberg.net> wrote:
>>> Hello,
>>>=20
>>> This is my first message in this mailing list, I just subscribed.
>>>=20
>>> I have been using Cassandra for the last few years and now I am =
trying to create a POC using HBase. Therefore, I am reading the HBase =
docs but it's been really hard to find how HBase behaves in some =
situations, when compared to Cassandra. I thought maybe it was a good =
idea to ask here, as people in this list might know the differences =
better than anyone else.
>>>=20
>>> What I want to do is creating a simple application optimized for =
writes (not interested in HBase / Cassandra product comparisions here, I =
am assuming I will use HBase and that's it, just wanna understand the =
best way of doing it in HBase world). I want to be able to write alerts =
to the cluster, where each alert would have columns like:
>>> - alert id
>>> - user id
>>> - date/time
>>> - alert data
>>>=20
>>> Later, I want to search for alerts per user, so my main query could =
be considered to be something like:
>>> Select * from alerts where user_id =3D $id and date/time > 10 days =
ago.
>>>=20
>>> I want to decide the data model for my application.
>>>=20
>>> Here are my questions:
>>>=20
>>> - In Cassandra, I would partition by user + day, as some users can =
have many alerts and some just 1 or a few. In hbase, assuming all alerts =
for a user would always fit in a single partition / region, can I just =
use user_id as my row key and assume data will be distributed along the =
cluster?
>>>=20
>>> - Suppose I want to write 100 000 rows from a client machine and =
these are from 30 000 users. What's the best manner to write these if I =
want to optimize for writes? Should I batch all 100 k requests in one to =
a single server? As I am trying to optimize for writes, I would like to =
split these requests across several nodes instead of sending them all to =
one. I found this article: =
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ =
But not sure if it's what I need
>>>=20
>>> Thanks in advance!
>>>=20
>>> Best regards,
>>> Marcelo.
>>=20
>>=20
>=20


--Apple-Mail=_F67F28C9-8EA9-4338-B837-5F8D0D8F623C
Content-Disposition: attachment;
	filename=smime.p7s
Content-Type: application/pkcs7-signature;
	name=smime.p7s
Content-Transfer-Encoding: base64

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIICujCCArYw
ggIfoAMCAQICCQCqTy7yM7wDDTANBgkqhkiG9w0BAQUFADB0MQswCQYDVQQGEwJVUzERMA8GA1UE
CAwISWxsaW5vaXMxEDAOBgNVBAcMB0NoaWNhZ28xDTALBgNVBAoMBE1TQ0MxEDAOBgNVBAMMB3Nl
Z2VsMDIxHzAdBgkqhkiG9w0BCQEWEG1zZWdlbEBzZWdlbC5jb20wHhcNMTMwOTA4MDYxMzA2WhcN
MjMwOTA2MDYxMzA2WjB0MQswCQYDVQQGEwJVUzERMA8GA1UECAwISWxsaW5vaXMxEDAOBgNVBAcM
B0NoaWNhZ28xDTALBgNVBAoMBE1TQ0MxEDAOBgNVBAMMB3NlZ2VsMDIxHzAdBgkqhkiG9w0BCQEW
EG1zZWdlbEBzZWdlbC5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBANbFWB63Zy/IoPTu
4yvE2CU3ePSWb0jvSWlJRWeVRZtXDlIooFY9f1gtDPhtSW/2b67lKOfTfP0o72keQQbZSqU0S1OE
Aw3NpTzvG2rRLF8Fex7bcRtdeW2ZCys8dQJIopB/nr15RAzcEM6aNFs8nUPDw1lKNRoqZTIe9ZW3
VHpBAgMBAAGjUDBOMB0GA1UdDgQWBBTpEjxThxqqZHnWTRn+xfboLLlaIzAfBgNVHSMEGDAWgBTp
EjxThxqqZHnWTRn+xfboLLlaIzAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBBQUAA4GBAGWVijpU
r12w+g5q/HlXUQjoCMoHphyWgbGK0PvucVBfyi3GnC7PqxFNQVfMOn9dBTYvK6FcFTZd77RTvniY
K3F7YdzM2fObId90X1xTm9M7RsYWqY4r8v5ObmAyQl9WkcKgs7KFAPgsroKR3ghlayyZy3EyAVpk
zL1QsPKCEA5AMYICtTCCArECAQEwgYEwdDELMAkGA1UEBhMCVVMxETAPBgNVBAgMCElsbGlub2lz
MRAwDgYDVQQHDAdDaGljYWdvMQ0wCwYDVQQKDARNU0NDMRAwDgYDVQQDDAdzZWdlbDAyMR8wHQYJ
KoZIhvcNAQkBFhBtc2VnZWxAc2VnZWwuY29tAgkAqk8u8jO8Aw0wCQYFKw4DAhoFAKCCAYkwGAYJ
KoZIhvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTUwMjI0MDAwOTMxWjAjBgkq
hkiG9w0BCQQxFgQU8XzG/1qN7fmfd1h35TNpQeFTTHIwgZIGCSsGAQQBgjcQBDGBhDCBgTB0MQsw
CQYDVQQGEwJVUzERMA8GA1UECAwISWxsaW5vaXMxEDAOBgNVBAcMB0NoaWNhZ28xDTALBgNVBAoM
BE1TQ0MxEDAOBgNVBAMMB3NlZ2VsMDIxHzAdBgkqhkiG9w0BCQEWEG1zZWdlbEBzZWdlbC5jb20C
CQCqTy7yM7wDDTCBlAYLKoZIhvcNAQkQAgsxgYSggYEwdDELMAkGA1UEBhMCVVMxETAPBgNVBAgM
CElsbGlub2lzMRAwDgYDVQQHDAdDaGljYWdvMQ0wCwYDVQQKDARNU0NDMRAwDgYDVQQDDAdzZWdl
bDAyMR8wHQYJKoZIhvcNAQkBFhBtc2VnZWxAc2VnZWwuY29tAgkAqk8u8jO8Aw0wDQYJKoZIhvcN
AQEBBQAEgYC3iX0VzLqiA0YL8yaZPo6+1FrgV1frgF8Hywui06CB8Aga+y+uwCYNyMM6umWcsv9S
EaP/4hxruh0XmjF6ufbLwV2xSN2B1xUsiyi0e3j/cWIHoGdCI/QTsSMn4OFoD2K2Yw4qWQ85VJe/
mE5DMfUMa+i3dkMmpJ+Rb6KCxKhr4QAAAAAAAA==
--Apple-Mail=_F67F28C9-8EA9-4338-B837-5F8D0D8F623C--