From Fuad Efendi <f...@efendi.ca>
Subject HBase Sample Schema
Date Mon, 22 Sep 2008 18:30:25 GMT

I found this basic sample and I'd like to confirm my understanding of  
use cases and best practices (applicability) of Hbase... Thanks!

Sample (Ankur Goel, 27-March-08,  
http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via  
hbase-user@hadoop.apache.org or Nabble):

DESCRIPTION: Used to store seed urls (both old and newly discovered).
              Initially populated with some seed URLs. The crawl
              picks up the seeds from this table that have status=0 (Not
  		 or status=2 (Visited, but ready for re-crawl) and feeds
these seeds
              in batch to different crawl engines that it knows about.

SCHEMA:      Columns families below

	  {"referer_id:", "100"}, // Integer here is Max_Length
         {"last_crawl_date:", "1000"},
         {"next_crawl_date:", "1000"},
         {"strike:", "100"},

Modified Schema & Analysis (Fuad Efendi):

My understanding is that we need to scan whole table in order to find  
records where (for instance) "last_crawl_date" is "less than specific  
point in time"... Additionally, Crawler should be polite and list of  
URLs to fetch should be evenly distributed between domains-hosts-IPs.

Few solutions to find records "last_crawl_date" were a little  
discussed in BLOGs, distribution list, etc:
- to have scanner
- to have additional Lucene index
- to have Map Reduce job (multithreaded parallel) otputting list of URLs

My own possible solution, need your feedback:

Simplified schema with two tables (non-transactional:

         {"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY  
(sorted row_id),

         {"url:","1500"}  PRIMARY KEY (sorted row_id),
         ... ... ...,

Table URL_TO_FETCH is initially seeded with root domain names and  
"dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
00000000000000000001  www.website1.com
00000000000000000002  www.website1.com
00000000000000000003  www.website1.com
00000000000000000004  www.website1.com

After successful fetch of initial URLs:
00000000010000000001  www.website1.com/page1
00000000010000000002  www.website2.com/page1
00000000010000000003  www.website3.com/page1
00000000010000000004  www.website4.com/page1
00000000020000000001  www.website1.com/page2
00000000020000000002  www.website2.com/page2
00000000020000000003  www.website3.com/page2
00000000020000000004  www.website4.com/page2
00000000030000000001  www.website1.com/page3
00000000030000000002  www.website2.com/page3
00000000030000000003  www.website3.com/page3
00000000030000000004  www.website4.com/page3
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com

(xxxxxxxxxx is "current time in milliseconds" - timestamp in case of  
successful fetch)

What we have: we don't need additional Lucene index; we don't need  
MapReduce job to populate list of items to be fetched (the way as it's  
done in Nutch); we don't need thousands per-host-scanners; we have  
mutable primary key; all new records are inserted at the beginning of  
a table; fetched items are moved to end of a table.

Second (helper) table is indexed by URL:
         {"url:","1500"}  PRIMARY KEY (sorted row_id),

Am I right? It looks cool that with extremely low cost I can maintain  
specific "reordering" by mutable primary key following crawl-specific  


