hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugene Kirpichov <ekirpic...@gmail.com>
Subject Re: execute millions of "grep"
Date Thu, 03 Nov 2011 10:52:38 GMT
If you really need to do millions of exact text queries against millions of
documents in realtime, a simple grep is not going to be sufficient for you.
You'll need smarter datastructures and algorithms.

Please specify how frequently the set of *queries* changes and what you
consider "real time".

On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne <oliver.krohne@yieldkit.com>wrote:

> Hi,
>
> I' am evaluating different solutions for massive phrase query execution. I
> need to execute millions of greps or more precise phrase queries consisting
> of 1-4 terms against millions of documents. I saw the hadoop grep example
> but this is executing grep with one regex.
>
> I also saw the "Side data distribution" / "Distributed Cache" possibility
> of hadoop. So I could pass them to the mapper and execute each query agains
> the input line. The input line would be the entire text of an document
> (usually 50-500 words).
>
> As I am aiming to  have these information almost in realtime another
> questions arises about adhoc map/reduce jobs. Is there a limit of running a
> lot of jobs in parallel, lets say if I would fire a new job once a new
> document arises. This job would only process that particular document. Or I
> would batch 100-1000 documents and then fire the job.
>
> Can anyone advise an approach of doing it with hadoop?
>
> Thanks in advance,
> Oliver
>
>
>
>
>
>
>
>
>
>
>
>
>
>


-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message