hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject RE: HBase Sample Schemas
Date Fri, 28 Mar 2008 05:28:28 GMT
Hi Bryan,
         Here is the sample schema I have (looks closer to RDBMS, I
know) 

TABLE:  	 seed_list

DESCRIPTION: Used to store seed urls (both old and newly discovered).
             Initially populated with some seed URLs. The crawl
controller
             picks up the seeds from this table that have status=0 (Not
Visited) 
 		 or status=2 (Visited, but ready for re-crawl) and feeds
these seeds
             in batch to different crawl engines that it knows about.
             
SCHEMA:      Columns families below

	  {"referer_id:", "100"}, // Integer here is Max_Length
        {"url:","1500"},
        {"site:","500"},
        {"last_crawl_date:", "1000"},
        {"next_crawl_date:", "1000"},
        {"create_date:","100"},
        {"status:","100"},
        {"strike:", "100"},
        {"language:","150"},
        {"topic:","500"},
        {"depth:","100000"}

Common attributes are [max versions: 1,  compression: NONE, in memory:
false, block cache enabled: true, max length: 100, bloom filter: none]


TABLE: 	 web_content

DESCRIPTION: Used to store information retrived after crawling a URL.
             Each crawl engines provides information about URL it
crawled.
             This information is then stored in this table depending
upon
             the profile settings (what should be stored?)
SCHEMA:	 Column families below

	    {"url:", "1500"},
          {"site:","500"},
          {"content_type:","100"},
          {"title:", "1000"},
          {"content:", Integer.MAX_VALUE + ""},
          {"parsed_text:",Integer.MAX_VALUE + ""},
          {"crawl_date:", "1000"},
          {"last_modified_date:","100"},
          {"http_headers:","10000"},
          {"content_length:","11"},
          {"outlinks_count:","100000"}

Common attributes are [max versions: 1,  compression: BLOCK, in memory:
false, block cache enabled: true, max length: 100, bloom filter: none]

Please feel free to suggest modifications/enhancements for column
oriented 
Design.

Thanks
-Ankur

-----Original Message-----
From: Bryan Duxbury [mailto:bryan@rapleaf.com] 
Sent: Friday, March 28, 2008 10:33 AM
To: hbase-user@hadoop.apache.org
Subject: HBase Sample Schemas

All,

One of the more common types of questions we get from people new to
HBase are about the differences in the schema between HBase and
relational databases. So that we can generate some good examples of
RDBMS schemas and their counterparts as they might be represented in
HBase, could you guys post some small (1-5 entities) schemas that you
might be interested in using and a few sentences about how you'd like to
consume them? We can then discuss possible options and see how things
might look. This will also help Stack, Jim, and myself to notice
interesting access patterns we might want to support.

Thanks in advance,

Bryan

Mime
View raw message