hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Varley <ivar...@salesforce.com>
Subject Re: Corresponding table in Hbase
Date Wed, 22 Feb 2012 15:32:05 GMT

HBase doesn't have the concept of a globally unique auto-incrementing "ID" column; that would
require that all PUTs to any region of a table first go through some central ID authority
to get a unique ID, and that sort of goes against the general HBase approach (in which operations
on regions are totally independent of each other, for unbounded horizontal scalability).

That said, there are a couple ways to achieve what it seems like you want:
 - You could create a natural compound row key composed of (for example) a hash of the URL
plus a timestamp. That way, you would be guaranteed that two crawls of the same URL appear
as different rows (assuming they can't happen at the same millisecond).
 - You could alternately use a UUID of some sort as the row key, but the advantage of using
URL_hash + timestamp is that you can find all the rows for a particular URL just by knowing
the URL; you don't need any external index.
 - You could also "roll your own" global ID creation counter in HBase using a table with a
counter in it, and use the atomic increment function to get unique values. That would still
serialize all PUT operations, but it would be done in your code (not automatically in HBase).

Remember that HBase doesn't have any secondary indexes, like the 3 you've added below. If
you want to be able to access the data in HBase by these fields, you must either write it
in that order according to the row key, or else manually write the information, denormalized,
into "index-like" tables that you maintain yourself (noting that there's no transactional
protection on this operation like in a relational database, so you must account for more failure
scenarios). These are reminders that unless your data size is so massive that a relational
database simply can't accommodate it, you're likely giving yourself more problems by using
HBase rather than an RDBMS.

Also: you might see the O'Reilly book, "HBase: The Definitive Guide" by the esteemed Mr. Lars
George; in it, he uses a running example of a URL shortener application that might give you
some ideas about your use case.


On Feb 21, 2012, at 11:33 PM, Adarsh Sharma wrote:

Dear all,

After some R n D on schema design in hbase. I am confused how to design
corresponding schema of a table in mysql.

CREATE TABLE `page_content` (
 `crawled_page_id` bigint(20) NOT NULL DEFAULT '0' 'unique value for
each url,
 `link_level` tinyint(4) DEFAULT NULL,
 `isprocessable` tinyint(2) NOT NULL DEFAULT '1',
 `isvalid` tinyint(4) NOT NULL DEFAULT '1',
 `isanalyzed` tinyint(4) NOT NULL DEFAULT '0' COMMENT ,
 `islocked` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'set 1 when the
records are in analyzing phase',
 `content_language` varchar(10) DEFAULT NULL,
 `url_id` bigint(20) NOT NULL,
 `publishing_date` varchar(40) DEFAULT NULL,
 `heading` varchar(150) DEFAULT NULL,
 `category` varchar(150) DEFAULT NULL,
 `crawled_page_url` varchar(500) NOT NULL,
 `keywords` varchar(500) DEFAULT NULL,
 `content` mediumtext,
 PRIMARY KEY (`crawled_page_id`),
 KEY `idx_url` (`crawled_page_url`),
 KEY `idx_head` (`heading`),
 KEY `idx_dtstamp` (`dt_stamp`)

In all the examples , I find the reverse url is the row key in hbase but
in mysql i create an auto increment column that uniquly locate a document.
Can anyone suggest what is the corresponding table in hbase.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message