mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: TU Berlin Winter of Code Project
Date Sun, 08 Nov 2009 00:23:06 GMT
Hi Max (& Ted),

On Nov 6, 2009, at 11:57am, Ted Dunning wrote:

> The question that I don't see addressed is whether you choose to use  
> a fully
> streaming approach as is done in Bixo or whether you will use a  
> document
> repository approach as is more common in most search engines.

I think the issue here isn't about streaming vs. document repository -  
all systems have elements of both, it's just that...

a. Bixo exposes this more explicitly, by focusing on the workflow  
aspects of web mining.

But Nutch also has sequences of map-reduce tasks that are run during a  
crawl (e.g. filter URLs, group them, then fetch & parse).

b. Bixo doesn't have a baked in URL database, or file-system scheme  
for saving content.

If you look at the example SimpleCrawlTool class in Bixo, for example,  
you'll see that it (similar to Nutch) is using a SequenceFile to store  
the URL state, and sequence files in sub-directories for fetched  
content & parse results.

But Bixo just does the simple thing of propagates the URL state  
forward into successive crawl directories, versus updating a single  
URL database. Having a URL DB is what you'd want for large-scale web  
crawling.

If you wanted to configure Bixo to use HBase to store the URL state  
and fetched/parsed content, you'd use an HBase tap (in Cascading- 
speak) versus the Hfs tap.

> Hbase is reputedly ready enough to serve as a document repository.   
> Using
> such an approach would be very helpful for the incremental nature of  
> web
> crawls.

I'd gotten the same input from Andrew Purtell, who's been able to  
stream lots of crawl data into HBase, after a bit of fiddling with  
configuration settings and also some patching on the writer side of  
things.

As far as pre-processing and feature extraction, both could be  
implemented as Cascading operations (that wind up mapping to Hadoop  
tasks).

As Ted noted, actually doing the named entity extraction and feature  
extraction will be the real challenge.

See this talk for an example of doing web mining using Bixo - http://www.slideshare.net/sh1mmer/the-bixo-web-mining-toolkit

-- Ken


> On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll  
> <gsingers@apache.org>wrote:
>
>>
>> This is obviously only a first draft of what we think would be a  
>> suited
>> overall
>> architecture
>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message