hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From N Kapshoo <nkaps...@gmail.com>
Subject Re: Long vs String for qualifier
Date Mon, 21 Jun 2010 17:44:55 GMT
Now here is my conundrum:
I would be doing both queries very often. The UI shows count(status=Y)
on the very first page and then depending on whether user does a
listing, would show all other info and status info per doc.

Is it a bad idea to have both a new ColumnFamily and store it in a
qualifier as well? Same data in 2 places, but it would help the read
performance in both queries right?

When you say append a byte, I assume something like this, am I right?

byte[] arr = Bytes.toBytes(docId);
arr[arr.length] = '0';

Thanks so much for your help.

On Mon, Jun 21, 2010 at 12:33 PM, Jonathan Gray <jgray@facebook.com> wrote:
> Got it.
>
> Well, you could do what you're describing below, appending something at the end of the
docId to notate that it's the status column.  You wouldn't need to use a "_status" string,
could be as simple as appending an additional byte of type information.
>
> Another option is to break status into a separate column family.
>
> What are the most common queries and which query is most critical performance-wise?
>
> Are you most interested in "give me all docs and their statuses for user X" or more like
"give me the info for doc Y" or "give me status for doc Z"?
>
> If the first one, then seems like adding a type byte after the docId would make the most
sense and be most optimal.
>
> JG
>
>> -----Original Message-----
>> From: N Kapshoo [mailto:nkapshoo@gmail.com]
>> Sent: Monday, June 21, 2010 10:26 AM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Long vs String for qualifier
>>
>> Thanks for the quick reply.
>>
>> I have a schema design based on ids because I actually have the ids as
>> rowids in another table. This is to avoid data redundancy since we
>> might have a big doc referenced by millions of users, but we dont want
>> to store a copy for every user. So,
>>
>> Table: Docs
>> Row: docId (long generated by incrementColumnValue)
>> ColFamily: Data
>>
>> Table: Users
>> Row: UserId
>> ColFamily: DocInfo
>> Qualifier: docId
>> Value: More information per user (JSON)
>>
>> Now in addition:
>> ColFamily: DocInfo
>> Qualifier: docId_status
>> Value: Status
>>
>> Now I want a status on each doc for each user. This status might
>> change several times.
>> The first column, docInfo is static, its value doesnt change once
>> inserted. However the status can be toggled back and forth (between Y
>> and N).
>>
>> The docs per user should always be sorted by docId.
>>
>> How would you design it? I am not sure how I can get the values into
>> the qualifiers when it should be sorted by docId always. Thank you.
>>
>> On Mon, Jun 21, 2010 at 12:12 PM, Jonathan Gray <jgray@facebook.com>
>> wrote:
>> > Can you describe your schema a bit more?  Could you use versioning
>> instead of incrementing IDs on the qualifiers?
>> >
>> > Also, you could consider having a composite value, so id1_asLong
>> would have a value that contained both val1 and val5 in your example.
>>  You could use any number of serialization strategies (comma-separated,
>> JSON, Thrift/protobuf, Writable, etc).
>> >
>> > If you want them as two columns, I would recommend that things you
>> want to retrieve together be neighboring.  For example, you might make
>> the qualifiers a composite type of <id_as_long><qf_type>, so
>> <id1_asLong><0byte> for the existing stuff and <id1_asLong><1byte>
for
>> status?  That way they are stored sequentially so optimally efficient
>> at read time.
>> >
>> > JG
>> >
>> >> -----Original Message-----
>> >> From: N Kapshoo [mailto:nkapshoo@gmail.com]
>> >> Sent: Monday, June 21, 2010 9:59 AM
>> >> To: hbase-user@hadoop.apache.org
>> >> Subject: Long vs String for qualifier
>> >>
>> >> I have a 'long' number that I get by using
>> >> HTable.'incrementColumnValue'. This long is used as the qualifier id
>> >> on a columnFamily.
>> >>
>> >> Now I need to add a prefix 'status' so that I can store another
>> value
>> >> in the same family.
>> >>
>> >> How should I consider String vs long sorting?
>> >>
>> >> So right now:
>> >>
>> >> colFamily: id1_asLong = val1
>> >> colFamily: id2_asLong = val2
>> >> colFamily: id3_asLong = val3
>> >> colFamily: id4_asLong = val4
>> >>
>> >> and in addition
>> >>
>> >> colFamily: status_id1_asString = val5
>> >> colFamily: status_id2_asString = val6
>> >> colFamily: status_id3_asString = val7
>> >> colFamily: status_id4_asString = val8
>> >>
>> >> To make sure that 'id' values are sorted and accessed sequentially,
>> >> should I change my design so that the id1_asLong is stored as
>> >> id1_asString?
>> >> When I do my Get, I always get id1_asLong and status_id1_asString
>> >> together.
>> >>
>> >> Thanks.
>> >
>

Mime
View raw message