From edward yoon <webmas...@udanax.org>
Subject RE: Possible hadoop application
Date Fri, 21 Dec 2007 21:14:53 GMT

>> ...Documents are indexed for searching.
>> query terms for ...

I thought inverted index will be used for your data mining application.
Then, i would recommend a survey of map/reduce. (Hadoop examples are great)

further references : 
Data mining, Document classification/categorization, Social Network Analysis, etc.


B. Regards,

Edward yoon @ NHN, corp.
Home : http://www.udanax.org

> Date: Fri, 21 Dec 2007 10:50:42 -0500
> From: kcorby@pf-cvl.net
> To: hadoop-user@lucene.apache.org
> Subject: Possible hadoop application
> Hello,
> I am just looking into Hadoop for a possible application and was hoping
> to get some feedback about whether it is a good fit and how to structure
> it. Basically my application works like this:
> 1. Documents arrive, maybe as part of a web crawl or something like that.
> 2. Documents are indexed for searching.
> 3. Documents have special fields extracted and stored, for instance all
> country names might be extracted as a COUNTRY field, dates as a DATE
> field, IP addresses as an IP field, etc.
> 4. Users run queries against the index to find matching documents.
> 5. Users run jobs that process some combination of the extracted field
> values and query terms for a (possibly large) number of documents to
> find patterns, relationships, etc.
> An example of #5 might be:
> Find all business-country relationships that exist in this set of
> document IDs where the previously extracted country name is within 20
> terms of a term matching a query of business names (not previously
> extracted or tagged): (McDonalds OR "Burger King" OR "Taco Bell" OR
> "Wal Mart" ...)
> The output would be something like:
> McDonald's - Mexico => Documents 5, 76, 100
> Wal Mart - Mexico => Documents 5, 22
> Wal Mart - United States => Documents 22, 43, 100, 101
> I work on an existing application that functions similarly to this. We
> are currently using Lucene for the search index and it functions fairly
> well, but it is difficult to scale #5 to a large number of users or
> documents and have it run in a reasonably responsive way.
> It seems that Hadoop might be a nice fit for this in a few places:
> 1) Indexing
> 2) Extraction of field values
> 3) Running of jobs to process field values / query terms
> I am especially interested in #3, but I'm not quite sure how it would
> work. How would the extracted values be stored for quick lookup by
> document ID and processing? Given that hadoop is read only, would I be
> forced to have many small files as new documents are added and
> processed, or can the new extractions be somehow combined with the old
> ones on the distributed file system?
> And would it be possible to use hadoop to dig the matching query terms
> out of the documents, since that can also be slow?
> Thanks for any feedback.
> - Kevin

