hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Duxbury <br...@rapleaf.com>
Subject Re: Row-key in HBase
Date Mon, 28 Apr 2008 14:08:47 GMT
Yes, MD5ing your urls will randomize the results. Do you need to  
access pages by MD5 of URL? If so its unlikely that you also need to  
access them by domain.
-Bryan

On Apr 28, 2008, at 4:01 AM, Goel, Ankur wrote:

> Hi folks,
>            I am using HBase table to store my crawled data and  
> using the
> MD5 signature of the canonicalized URL as a row key in HBase. The
> bigtable paper suggest using keys appropriately so that URLs from the
> same domain are stored close to each other and domain analysis can be
> carried out efficiently.
> So for e.g. storing page maps.google.com/index.html should use row-key
> com.google.maps/index.html.
>
> My question is will using MD5 signature of canonicalized URL hurt data
> locality of URLs from same domains ?
>
> Thanks
> -Ankur


Mime
View raw message