hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oliver Krohne <oliver.kro...@yieldkit.com>
Subject execute millions of "grep"
Date Thu, 03 Nov 2011 10:46:42 GMT

I' am evaluating different solutions for massive phrase query execution. I need to execute
millions of greps or more precise phrase queries consisting of 1-4 terms against millions
of documents. I saw the hadoop grep example but this is executing grep with one regex.

I also saw the "Side data distribution" / "Distributed Cache" possibility of hadoop. So I
could pass them to the mapper and execute each query agains the input line. The input line
would be the entire text of an document (usually 50-500 words). 

As I am aiming to  have these information almost in realtime another questions arises about
adhoc map/reduce jobs. Is there a limit of running a lot of jobs in parallel, lets say if
I would fire a new job once a new document arises. This job would only process that particular
document. Or I would batch 100-1000 documents and then fire the job. 

Can anyone advise an approach of doing it with hadoop?

Thanks in advance,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message