hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toby DiPasquale <codeslin...@gmail.com>
Subject Re: Possible hadoop application
Date Fri, 21 Dec 2007 17:09:37 GMT
You might want to look at CouchDB for this. It is stronger on the  
query side of things right now and has a similar model.

--
Toby DiPasquale
Software Assassin

On Dec 21, 2007, at 10:50, Kevin Corby <kcorby@pf-cvl.net> wrote:

> Hello,
>
> I am just looking into Hadoop for a possible application and was  
> hoping to get some feedback about whether it is a good fit and how  
> to structure it. Basically my application works like this:
> 1. Documents arrive, maybe as part of a web crawl or something like  
> that.
> 2. Documents are indexed for searching.
> 3. Documents have special fields extracted and stored, for instance  
> all country names might be extracted as a COUNTRY field, dates as a  
> DATE field, IP addresses as an IP field, etc.
> 4. Users run queries against the index to find matching documents.
> 5. Users run jobs that process some combination of the extracted  
> field values and query terms for a (possibly large) number of  
> documents to find patterns, relationships, etc.
>
> An example of #5 might be:
> Find all business-country relationships that exist in this set of  
> document IDs where the previously extracted country name is within  
> 20 terms of a term matching a query of business names (not  
> previously extracted or tagged):  (McDonalds OR "Burger King" OR  
> "Taco Bell" OR "Wal Mart" ...)
>
> The output would be something like:
> McDonald's - Mexico => Documents 5, 76, 100
> Wal Mart - Mexico => Documents 5, 22
> Wal Mart - United States => Documents 22, 43, 100, 101
>
> I work on an existing application that functions similarly to this.  
> We are currently using Lucene for the search index and it functions  
> fairly well, but it is difficult to scale #5 to a large number of  
> users or documents and have it run in a reasonably responsive way.
>
> It seems that Hadoop might be a nice fit for this in a few places:
> 1) Indexing
> 2) Extraction of field values
> 3) Running of jobs to process field values / query terms
>
> I am especially interested in #3, but I'm not quite sure how it  
> would work. How would the extracted values be stored for quick  
> lookup by document ID and processing? Given that hadoop is read  
> only, would I be forced to have many small files as new documents  
> are added and processed, or can the new extractions be somehow  
> combined with the old ones on the distributed file system?
>
> And would it be possible to use hadoop to dig the matching query  
> terms out of the documents, since that can also be slow?
>
> Thanks for any feedback.
>
> - Kevin

Mime
View raw message