Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: "tgh" <guanhua.tian@ia.ac.cn>
To: <user@hbase.apache.org>
References: 
 <CAAMYKhqPizLtBHbUgcF_oce5wKrK_U9W4y4qGFzUTvjQcatbDw@mail.gmail.com>
 <CALte62ycGA-0_9aw1-z3CQ7q6osObQ+k2ugBdi979koyLguskw@mail.gmail.com>
 <002a01cdd35c$9e556160$db002420$@tian@ia.ac.cn>
 <5D80DD85-3C61-432C-B7C1-4C7124AA80B9@salesforce.com>
 <005801cdd361$d08862e0$719928a0$@tian@ia.ac.cn>
 <4B7F71B9-09B4-4EBE-AF6E-00CEC3507C86@salesforce.com>
 <015d01cdd383$177fa430$467eec90$@tian@ia.ac.cn>
In-Reply-To: <015d01cdd383$177fa430$467eec90$@tian@ia.ac.cn>
Subject: =?gb2312?B?tPC4tDogtPC4tDogaG93IHRvIHN0b3JlIDEwMGJpbGxpb24gc2hvcg==?=
	=?gb2312?B?dCB0ZXh0IG1lc3NhZ2VzIHdpdGggaGJhc2U=?=
Date: Thu, 6 Dec 2012 16:01:11 +0800
Message-ID: <018201cdd387$dd0c9f10$9725dd30$@tian@ia.ac.cn>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="gb2312"
Content-Transfer-Encoding: quoted-printable
Thread-Index: Ac3TY/7HwffAAZSqTvi5qgaJDN04dAAHJ58gAAFnUyA=
Content-Language: zh-cn

Meanwhile, we need lucene to retrieve message with keywords or content =
in
message, after NLP parse processing, and do it without timestamp or
messageID, it is time critical operation,
And we do read one hour data, not with lucene, but with table name, if =
we
use timestamp about hour as tablename , such as 2012120612 as for table =
of
data for 12clock on Des 12 2012, and it is about 100million to =
200million
messages in table, it is not very time critical operation,=20
And if we have 365*24table for one year , does it work , or if we put =
one
year data in ONE table, will it work more faster than multi tables, and =
why?
How does hbase manage ONE table and how to handle many table,=20
I am really confused,=20

Could you help me

Thank you
------------------------------------
Tian Guanhua


-----=D3=CA=BC=FE=D4=AD=BC=FE-----
=B7=A2=BC=FE=C8=CB: =
user-return-32260-guanhua.tian=3Dia.ac.cn@hbase.apache.org
[mailto:user-return-32260-guanhua.tian=3Dia.ac.cn@hbase.apache.org] =
=B4=FA=B1=ED tgh
=B7=A2=CB=CD=CA=B1=BC=E4: 2012=C4=EA12=D4=C26=C8=D5 15:27
=CA=D5=BC=FE=C8=CB: user@hbase.apache.org
=D6=F7=CC=E2: =B4=F0=B8=B4: =B4=F0=B8=B4: how to store 100billion short =
text messages with hbase

Thank you for your reply

And in my case, we need to use lucene search engine to retrieval short
message in hbase, and this operation is time critical,=20
and we also need to access last hour's data in hbase, that is, read out =
one
hour data from hbase, and this operation is not very time cirtical, and =
one
hour data is about 100 million or 200 million message,
Meanwhile, when lucene retrieve data from hbase, it may get 1k or 100k
messages for results, and  we need to guarantee this is fast enough,=20
And for this case, if we use one table, when lucene use any message, =
hbase
need to handle and locate 100billion message itself, if we use 365*24 =
table
or 365 table, hbase need to handle and locate much less message,=20

I am really confused ,why ONE table is more suitable than multi table,=20
Could you give me some help,=20

Thank you=20
-------------------------
Tian Guanhua


-----=D3=CA=BC=FE=D4=AD=BC=FE-----
=B7=A2=BC=FE=C8=CB: =
user-return-32251-guanhua.tian=3Dia.ac.cn@hbase.apache.org
[mailto:user-return-32251-guanhua.tian=3Dia.ac.cn@hbase.apache.org] =
=B4=FA=B1=ED Ian
Varley
=B7=A2=CB=CD=CA=B1=BC=E4: 2012=C4=EA12=D4=C26=C8=D5 11:44
=CA=D5=BC=FE=C8=CB: user@hbase.apache.org
=D6=F7=CC=E2: Re: =B4=F0=B8=B4: how to store 100billion short text =
messages with hbase

In this case, your best bet may be to come up with an ID structure for =
these
messages that incorporates (leads with) the timestamp; then have Lucene =
use
that as the key when retrieving any given message. For example, the ID =
could
consist of:

{timestamp} + {unique id}

(Beware: if you're going to load data with this schema in real time, =
you'll
hot spot one region server; see =
http://hbase.apache.org/book.html#timeseries
for considerations related to this.)

Then, you can either scan over all data from one time period, or GET a
particular message by this (combined) unique ID. There are also types of
UUIDs that work in this way. But, with that much data, you may want to =
tune
it to get the smallest possible row key; depending on the granularity of
your timestamp and how unique the "unique" part really needs to be, you
might be able to get this down to < 16 bytes. (Consider that the =
smallest
possible unique representation of 100B items is 36 bits - that is, log =
base
2 of 10 billion; but because you also want time to be a part of it, you
probably can't get anywhere near that small).

If you need to scan over LOTS of data (as opposed to just looking up =
single
messages, or small sequential chunks of messages), consider just writing =
the
data to a file in HDFS and using map/reduce to process it. Scanning all =
100B
of your records won't be possible in any short time frame (by my =
estimate
that would take about 10 hours), but you could do that with map/reduce =
using
an asynchronous model.

One table is still best for this; read up on what Regions are and why =
they
mean you don't need multiple tables for the same data:
http://hbase.apache.org/book.html#regions.arch

There are no secondary indexes in HBase:
http://hbase.apache.org/book.html#secondary.indexes. If you use Lucene =
for
this, it'd need its own storage (though there are indeed projects that =
run
Lucene on top of HBase: http://www.infoq.com/articles/LuceneHbase).

Ian


On Dec 5, 2012, at 9:28 PM, tgh wrote:

Thank you for your reply

And I want to access the data with lucene search engine, that is, with =
key
to retrieve any message, and I also want to get one hour data together, =
so I
think to split data table into one hour , or if I can store it in one =
big
table, is it better than store in 365 table or store in 365*24 table, =
which
one is best for my data access schema, and I am also confused about how =
to
make secondary index in hbase , if I have use some key words search =
engine ,
lucene or other


Could you help me
Thank you

-------------
Tian Guanhua


-----=D3=CA=BC=FE=D4=AD=BC=FE-----
=B7=A2=BC=FE=C8=CB:
user-return-32247-guanhua.tian=3Dia.ac.cn@hbase.apache.org<mailto:user-re=
turn-
32247-guanhua.tian=3Dia.ac.cn@hbase.apache.org>
[mailto:user-return-32247-guanhua.tian=3Dia.ac.cn@hbase.apache.org] =
=B4=FA=B1=ED Ian
Varley
=B7=A2=CB=CD=CA=B1=BC=E4: 2012=C4=EA12=D4=C26=C8=D5 11:01
=CA=D5=BC=FE=C8=CB: user@hbase.apache.org<mailto:user@hbase.apache.org>
=D6=F7=CC=E2: Re: how to store 100billion short text messages with hbase

Tian,

The best way to think about how to structure your data in HBase is to =
ask
the question: "How will I access it?". Perhaps you could reply with the
sorts of queries you expect to be able to do over this data? For =
example,
retrieve any single conversation between two people in < 10 ms; or show =
all
conversations that happened in a single hour, regardless of =
participants.
HBase only gives you fast GET/SCAN access along a single "primary" key =
(the
row key) so you must choose it carefully, or else duplicate & =
denormalize
your data for fast access.

Your data size seems reasonable (but not overwhelming) for HBase. 100B
messages x 1K bytes per message on average comes out to 100TB. That, =
plus 3x
replication in HDFS, means you need roughly 300TB of space. If you have =
13
nodes (taking out 2 for redundant master services) that's a requirement =
for
about 23T of space per server. That's a lot, even these days. Did I get =
all
that math right?

On your question about multiple tables: a table in HBase is only a =
namespace
for rowkeys, and a container for a set of regions. If it's a homogenous =
data
set, there's no advantage to breaking the table into multiple tables; =
that's
what regions within the table are for.

Ian

ps - Please don't cross post to both dev@ and user@.

On Dec 5, 2012, at 8:51 PM, tgh wrote:

Hi
I try to use hbase to store 100billion short texts messages, each =
message
has less than 1000 character and some other items, that is, each =
messages
has less than 10 items, The whole data is a stream for about one year, =
and I
want to create multi tables to store these data, I have two ideas, the =
one
is to store the data in one hour in one table, and for one year data, =
there
are 365*24 tables, the other is to store the date in one day in one =
table,
and for one year , there are 365 tables,

And I have about 15 computer nodes to handle these data, and I want to =
know
how to deal with these data, the one for 365*24 tables , or the one for =
365
tables, or other better ideas,

I am really confused about hbase, it is powerful yet a bit complex for =
me ,
is it?
Could you give me some advice for hbase data schema and others, Could =
you
help me,


Thank you
---------------------------------
Tian Guanhua