Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9D9EFDFF1 for ; Thu, 6 Dec 2012 08:01:50 +0000 (UTC) Received: (qmail 50705 invoked by uid 500); 6 Dec 2012 08:01:48 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 50467 invoked by uid 500); 6 Dec 2012 08:01:47 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 50429 invoked by uid 99); 6 Dec 2012 08:01:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 08:01:46 +0000 X-ASF-Spam-Status: No, hits=1.0 required=5.0 tests=MSGID_MULTIPLE_AT,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [159.226.251.23] (HELO cstnet.cn) (159.226.251.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Dec 2012 08:01:39 +0000 Received: from tianguanhua (unknown [159.226.20.193]) by app3 (Coremail) with SMTP id SQCowJDrZALFUMBQcs6SAA--.32886S2; Thu, 06 Dec 2012 16:01:12 +0800 (CST) From: "tgh" To: References: <002a01cdd35c$9e556160$db002420$@tian@ia.ac.cn> <5D80DD85-3C61-432C-B7C1-4C7124AA80B9@salesforce.com> <005801cdd361$d08862e0$719928a0$@tian@ia.ac.cn> <4B7F71B9-09B4-4EBE-AF6E-00CEC3507C86@salesforce.com> <015d01cdd383$177fa430$467eec90$@tian@ia.ac.cn> In-Reply-To: <015d01cdd383$177fa430$467eec90$@tian@ia.ac.cn> Subject: =?gb2312?B?tPC4tDogtPC4tDogaG93IHRvIHN0b3JlIDEwMGJpbGxpb24gc2hvcg==?= =?gb2312?B?dCB0ZXh0IG1lc3NhZ2VzIHdpdGggaGJhc2U=?= Date: Thu, 6 Dec 2012 16:01:11 +0800 Message-ID: <018201cdd387$dd0c9f10$9725dd30$@tian@ia.ac.cn> MIME-Version: 1.0 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Ac3TY/7HwffAAZSqTvi5qgaJDN04dAAHJ58gAAFnUyA= Content-Language: zh-cn X-CM-TRANSID: SQCowJDrZALFUMBQcs6SAA--.32886S2 X-Coremail-Antispam: 1UD129KBjvJXoW3AF43JrW8Jr18Wry5ZrW3trb_yoW3Gr4xpr W3tryak3WFqrykAFWxAw12yr10gwsxGrZrtr4UG34Fyw1UCF1I9FW8trZ2kFyUXFZ7J34q q3Wqqa4UCFnYvaDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUkKb7Iv0xC_Kw4lb4IE77IF4wAFF20E14v26r1j6r4UM7CY07I2 0VC2zVCF04k26cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rw A2z4x0Y4vE2Ix0cI8IcVAFwI0_Xr0_Ar1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI0_Gr0_ Cr1l84ACjcxK6I8E87Iv67AKxVW8Jr0_Cr1UM28EF7xvwVC2z280aVCY1x0267AKxVWxJr 0_GcWle2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E 2Ix0cI8IcVAFwI0_Jw0_WrylYx0Ex4A2jsIE14v26F4UJVW0owAm72CE4IkC6x0Yz7v_Jr 0_Gr1lF7xvr2IYc2Ij64vIr41lF7xvr2IYc2Ij64vIr40E4x8a64kEw24lF7xvrVCFI7AF 6II2Y40_Zr0_Gr1UMxkIecxEwVAFwVW8ZwCF04k20xvY0x0EwIxGrwC20s026c02F40E14 v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jr0_JrylIxkG c2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxVAFwI 0_Jr0_Gr1lIxAIcVCF04k26cxKx2IYs7xG6rWUJVWrZr1UMIIF0xvEx4A2jsIE14v26r1j 6r4UMIIF0xvEx4A2jsIEc7CjxVAFwI0_Jr0_GrUvcSsGvfC2KfnxnUUI43ZEXa7IUOHmh7 UUUUU== X-CM-SenderInfo: hjxd0xtxdo3x1dq6xtwodfhubq/ X-Virus-Checked: Checked by ClamAV on apache.org Meanwhile, we need lucene to retrieve message with keywords or content = in message, after NLP parse processing, and do it without timestamp or messageID, it is time critical operation, And we do read one hour data, not with lucene, but with table name, if = we use timestamp about hour as tablename , such as 2012120612 as for table = of data for 12clock on Des 12 2012, and it is about 100million to = 200million messages in table, it is not very time critical operation,=20 And if we have 365*24table for one year , does it work , or if we put = one year data in ONE table, will it work more faster than multi tables, and = why? How does hbase manage ONE table and how to handle many table,=20 I am really confused,=20 Could you help me Thank you ------------------------------------ Tian Guanhua -----=D3=CA=BC=FE=D4=AD=BC=FE----- =B7=A2=BC=FE=C8=CB: = user-return-32260-guanhua.tian=3Dia.ac.cn@hbase.apache.org [mailto:user-return-32260-guanhua.tian=3Dia.ac.cn@hbase.apache.org] = =B4=FA=B1=ED tgh =B7=A2=CB=CD=CA=B1=BC=E4: 2012=C4=EA12=D4=C26=C8=D5 15:27 =CA=D5=BC=FE=C8=CB: user@hbase.apache.org =D6=F7=CC=E2: =B4=F0=B8=B4: =B4=F0=B8=B4: how to store 100billion short = text messages with hbase Thank you for your reply And in my case, we need to use lucene search engine to retrieval short message in hbase, and this operation is time critical,=20 and we also need to access last hour's data in hbase, that is, read out = one hour data from hbase, and this operation is not very time cirtical, and = one hour data is about 100 million or 200 million message, Meanwhile, when lucene retrieve data from hbase, it may get 1k or 100k messages for results, and we need to guarantee this is fast enough,=20 And for this case, if we use one table, when lucene use any message, = hbase need to handle and locate 100billion message itself, if we use 365*24 = table or 365 table, hbase need to handle and locate much less message,=20 I am really confused ,why ONE table is more suitable than multi table,=20 Could you give me some help,=20 Thank you=20 ------------------------- Tian Guanhua -----=D3=CA=BC=FE=D4=AD=BC=FE----- =B7=A2=BC=FE=C8=CB: = user-return-32251-guanhua.tian=3Dia.ac.cn@hbase.apache.org [mailto:user-return-32251-guanhua.tian=3Dia.ac.cn@hbase.apache.org] = =B4=FA=B1=ED Ian Varley =B7=A2=CB=CD=CA=B1=BC=E4: 2012=C4=EA12=D4=C26=C8=D5 11:44 =CA=D5=BC=FE=C8=CB: user@hbase.apache.org =D6=F7=CC=E2: Re: =B4=F0=B8=B4: how to store 100billion short text = messages with hbase In this case, your best bet may be to come up with an ID structure for = these messages that incorporates (leads with) the timestamp; then have Lucene = use that as the key when retrieving any given message. For example, the ID = could consist of: {timestamp} + {unique id} (Beware: if you're going to load data with this schema in real time, = you'll hot spot one region server; see = http://hbase.apache.org/book.html#timeseries for considerations related to this.) Then, you can either scan over all data from one time period, or GET a particular message by this (combined) unique ID. There are also types of UUIDs that work in this way. But, with that much data, you may want to = tune it to get the smallest possible row key; depending on the granularity of your timestamp and how unique the "unique" part really needs to be, you might be able to get this down to < 16 bytes. (Consider that the = smallest possible unique representation of 100B items is 36 bits - that is, log = base 2 of 10 billion; but because you also want time to be a part of it, you probably can't get anywhere near that small). If you need to scan over LOTS of data (as opposed to just looking up = single messages, or small sequential chunks of messages), consider just writing = the data to a file in HDFS and using map/reduce to process it. Scanning all = 100B of your records won't be possible in any short time frame (by my = estimate that would take about 10 hours), but you could do that with map/reduce = using an asynchronous model. One table is still best for this; read up on what Regions are and why = they mean you don't need multiple tables for the same data: http://hbase.apache.org/book.html#regions.arch There are no secondary indexes in HBase: http://hbase.apache.org/book.html#secondary.indexes. If you use Lucene = for this, it'd need its own storage (though there are indeed projects that = run Lucene on top of HBase: http://www.infoq.com/articles/LuceneHbase). Ian On Dec 5, 2012, at 9:28 PM, tgh wrote: Thank you for your reply And I want to access the data with lucene search engine, that is, with = key to retrieve any message, and I also want to get one hour data together, = so I think to split data table into one hour , or if I can store it in one = big table, is it better than store in 365 table or store in 365*24 table, = which one is best for my data access schema, and I am also confused about how = to make secondary index in hbase , if I have use some key words search = engine , lucene or other Could you help me Thank you ------------- Tian Guanhua -----=D3=CA=BC=FE=D4=AD=BC=FE----- =B7=A2=BC=FE=C8=CB: user-return-32247-guanhua.tian=3Dia.ac.cn@hbase.apache.org [mailto:user-return-32247-guanhua.tian=3Dia.ac.cn@hbase.apache.org] = =B4=FA=B1=ED Ian Varley =B7=A2=CB=CD=CA=B1=BC=E4: 2012=C4=EA12=D4=C26=C8=D5 11:01 =CA=D5=BC=FE=C8=CB: user@hbase.apache.org =D6=F7=CC=E2: Re: how to store 100billion short text messages with hbase Tian, The best way to think about how to structure your data in HBase is to = ask the question: "How will I access it?". Perhaps you could reply with the sorts of queries you expect to be able to do over this data? For = example, retrieve any single conversation between two people in < 10 ms; or show = all conversations that happened in a single hour, regardless of = participants. HBase only gives you fast GET/SCAN access along a single "primary" key = (the row key) so you must choose it carefully, or else duplicate & = denormalize your data for fast access. Your data size seems reasonable (but not overwhelming) for HBase. 100B messages x 1K bytes per message on average comes out to 100TB. That, = plus 3x replication in HDFS, means you need roughly 300TB of space. If you have = 13 nodes (taking out 2 for redundant master services) that's a requirement = for about 23T of space per server. That's a lot, even these days. Did I get = all that math right? On your question about multiple tables: a table in HBase is only a = namespace for rowkeys, and a container for a set of regions. If it's a homogenous = data set, there's no advantage to breaking the table into multiple tables; = that's what regions within the table are for. Ian ps - Please don't cross post to both dev@ and user@. On Dec 5, 2012, at 8:51 PM, tgh wrote: Hi I try to use hbase to store 100billion short texts messages, each = message has less than 1000 character and some other items, that is, each = messages has less than 10 items, The whole data is a stream for about one year, = and I want to create multi tables to store these data, I have two ideas, the = one is to store the data in one hour in one table, and for one year data, = there are 365*24 tables, the other is to store the date in one day in one = table, and for one year , there are 365 tables, And I have about 15 computer nodes to handle these data, and I want to = know how to deal with these data, the one for 365*24 tables , or the one for = 365 tables, or other better ideas, I am really confused about hbase, it is powerful yet a bit complex for = me , is it? Could you give me some advice for hbase data schema and others, Could = you help me, Thank you --------------------------------- Tian Guanhua