hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: HBase Sample Schema
Date Tue, 23 Sep 2008 00:56:58 GMT
Probably this is mistake in design: 

1. URL_TO_FETCH 
         {"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY   


Should be reversed: "last_crawl_date" + "per_host_link_counter" + "host" 

[0000 +  0002 + www.website1.com]: www.website1.com/page2 
[0000 +  0002 + www.website2.com]: www.website2.com/page2 
[0000 +  0002 + www.website3.com]: www.website3.com/page2 
... 
[0000 +  0003 + www.website1.com]: www.website1.com/page3 
[0000 +  0003 + www.website1.com]: www.website1.com/page3 
[0000 +  0003 + www.website1.com]: www.website1.com/page3 
... 
[XXXX +  0000 + www.website1.com]: www.website1.com 
[XXXX +  0000 + www.website2.com]: www.website2.com 
[XXXX +  0000 + www.website3.com]: www.website3.com 
[XXXX +  0001 + www.website1.com]: www.website1.com/page1 
[XXXX +  0001 + www.website2.com]: www.website2.com/page1 
[XXXX +  0001 + www.website3.com]: www.website3.com/page1 


where XXXX is timestamp: last_crawl_date (successful crawl) 

Doing "delete" with "insert" instead of modifying PK; although it does not
matter for HBase (?) 


Thanks... Any thoughts?


http://www.linkedin.com/in/liferay
==================================
Tokenizer Inc.
Project Management, Software Development
Natural Language Processing, Search Engines
==================================
http://www.tokenizer.org


> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca] 
> Sent: Monday, September 22, 2008 2:30 PM
> To: hbase-user@hadoop.apache.org
> Subject: HBase Sample Schema
> 
> 
> Hi,
> 
> I found this basic sample and I'd like to confirm my 
> understanding of  
> use cases and best practices (applicability) of Hbase... Thanks!
> =============
> 
> 
> Sample (Ankur Goel, 27-March-08,  
> http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via  
> hbase-user@hadoop.apache.org or Nabble):
> =============
> 
> DESCRIPTION: Used to store seed urls (both old and newly discovered).
>               Initially populated with some seed URLs. The crawl
> controller
>               picks up the seeds from this table that have 
> status=0 (Not
> Visited)
>   		 or status=2 (Visited, but ready for re-crawl) and feeds
> these seeds
>               in batch to different crawl engines that it knows about.
> 
> SCHEMA:      Columns families below
> 
> 	  {"referer_id:", "100"}, // Integer here is Max_Length
>          {"url:","1500"},
>          {"site:","500"},
>          {"last_crawl_date:", "1000"},
>          {"next_crawl_date:", "1000"},
>          {"create_date:","100"},
>          {"status:","100"},
>          {"strike:", "100"},
>          {"language:","150"},
>          {"topic:","500"},
>          {"depth:","100000"}
> 
> 
> ======================
> Modified Schema & Analysis (Fuad Efendi):
> 
> My understanding is that we need to scan whole table in order 
> to find  
> records where (for instance) "last_crawl_date" is "less than 
> specific  
> point in time"... Additionally, Crawler should be polite and list of  
> URLs to fetch should be evenly distributed between domains-hosts-IPs.
> 
> Few solutions to find records "last_crawl_date" were a little  
> discussed in BLOGs, distribution list, etc:
> - to have scanner
> - to have additional Lucene index
> - to have Map Reduce job (multithreaded parallel) otputting 
> list of URLs
> 
> 
> My own possible solution, need your feedback:
> ====================
> 
> Simplified schema with two tables (non-transactional:
> 
> 1. URL_TO_FETCH
>          {"internal_link_id" + "last_crawl_date", "1000"} 
> PRIMARY KEY  
> (sorted row_id),
>          {"url:","1500"},
> 
> 2. URL_CONTENT
>          {"url:","1500"}  PRIMARY KEY (sorted row_id),
>          {"site:","500"},
>          ... ... ...,
>          {"language:","150"},
>          {"topic:","500"},
>          {"depth:","100000"}
> 
> 
> Table URL_TO_FETCH is initially seeded with root domain names and  
> "dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
> 00000000000000000001  www.website1.com
> 00000000000000000002  www.website1.com
> 00000000000000000003  www.website1.com
> 00000000000000000004  www.website1.com
> ...
> 
> 
> After successful fetch of initial URLs:
> 00000000010000000001  www.website1.com/page1
> 00000000010000000002  www.website2.com/page1
> 00000000010000000003  www.website3.com/page1
> 00000000010000000004  www.website4.com/page1
> ...
> 00000000020000000001  www.website1.com/page2
> 00000000020000000002  www.website2.com/page2
> 00000000020000000003  www.website3.com/page2
> 00000000020000000004  www.website4.com/page2
> ...
> 00000000030000000001  www.website1.com/page3
> 00000000030000000002  www.website2.com/page3
> 00000000030000000003  www.website3.com/page3
> 00000000030000000004  www.website4.com/page3
> ...
> ...
> ...
> 0000000000xxxxxxxxxx  www.website1.com
> 0000000000xxxxxxxxxx  www.website1.com
> 0000000000xxxxxxxxxx  www.website1.com
> 0000000000xxxxxxxxxx  www.website1.com
> ...
> 
> (xxxxxxxxxx is "current time in milliseconds" - timestamp in case of  
> successful fetch)
> 
> What we have: we don't need additional Lucene index; we don't need  
> MapReduce job to populate list of items to be fetched (the 
> way as it's  
> done in Nutch); we don't need thousands per-host-scanners; we have  
> mutable primary key; all new records are inserted at the 
> beginning of  
> a table; fetched items are moved to end of a table.
> 
> Second (helper) table is indexed by URL:
>          {"url:","1500"}  PRIMARY KEY (sorted row_id),
>          ...
> 
> 
> Am I right? It looks cool that with extremely low cost I can 
> maintain  
> specific "reordering" by mutable primary key following 
> crawl-specific  
> requirements...
> 
> Thanks,
> Fuad
> 
> 
> 
> 


Mime
View raw message