hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Meil <doug.m...@explorysmedical.com>
Subject Re: Basic -Hbase table question
Date Wed, 18 Apr 2012 19:38:19 GMT

Hi there-

Because your topic is webcrawling, you might want to read the BigTable
paper because the example in that paper is about webcrawling.

You can find that, and other info, in the RefGuide...


On 4/18/12 2:08 PM, "petri koski" <monzuun@gmail.com> wrote:

>I am quite new to Hbase, and here comes my question:
>I have a table. What I do with hadoop is to download webpages in MAP
>-phase, extract Urls found and save them in Reduce -phase. I read  from
>table, and I save them (put) to same table to avoid duplicates etc.
>I will get millions of rows, unique ones. Some times, actually quite
>timestamps are reset because sometimes duplicates are found.
>Question is:
>Should I keep on doing those M/R in what way:
>1. somehow save last Maps ROW -position and pass that info to next MAP to
>start from .. this way I wouldnt have to process processed rows .. Of
>course I have to spider sites all over again after they are finnished so
>but this option would give me some control when site is finnished ..
>2. Everytime start from row 0 and proceed to last one and start all over
>again and go little bit deeper to site you are "spidering" ..
>That option number 2 is good coz many sites get newest info on first
>so in that way I could keep my own data updated from those sites, but
>flipside is that I dont know when site is crawled ..
>Option nro 1. seemed to be wise, but there is something un - Hbase, and un
>- Hadoop like thinking: They are ment to take all in at once and process
>them at once and in case you need more, you chain M/R .. So, my option nro
>2 is more like hadoop/hbase way.. And like I said before, I will not just
>spider one site once and forget it, I will do it again after I have
>finnished doing it once etc.
>Which one is better ..

View raw message