hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Koch <tho...@koch.ro>
Subject feed queue fetcher with hadoop/zookeeper/gearman?
Date Mon, 12 Apr 2010 08:21:26 GMT
Hi,

I'd like to implement a feed loader with Hadoop and most likely HBase. I've 
got around 1 million feeds, that should be loaded and checked for new entries. 
However the feeds have different priorities based on their average update 
frequency in the past and their relevance.
The feeds (url, last_fetched timestamp, priority) are stored in HBase. How 
could I implement the fetch queue for the loaders?

- An hourly map-reduce job to produce new queues for each node and save them 
on the nodes?
  - but how to know, which feeds have been fetched in the last hour?
  - what to do, if a fetch node dies?

- Store a fetch queue in zookeeper and add to the queue with map-reduce each 
hour?
  - Isn't that too much load for zookeeper? (I could make one znode for a 
bunch of urls...?)

- Use gearman to store the fetch queue?
  - But the gearman job server still seems to be a SPOF

[1] http://gearman.org

Thank you!

Thomas Koch, http://www.koch.ro

Mime
View raw message