hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oliver Krohne <oliver.kro...@yieldkit.com>
Subject Re: execute millions of "grep"
Date Thu, 03 Nov 2011 11:20:43 GMT
Hi Eugene,

thanks for the quick reply.

Not only the queries are changing but also the documents:

a)The long time goal would be able to process a document or a batch of documents after a certain
event which a mean with "realtime". There will be different types of documents which are updated
differently. One set documents will be updated every 1-2 hours. Other documents will be updated
once a week. So an event means that the queries have to run against that set documents which
are updated. So probably I need to run the queries every hour against against a subset and
one a week the other documents need to be searched.

b)The queries are changing a couple of times a day. At this stage we have around 100k queries
but the will grow every day. A huge set of the queries remain the same but I expect that after
3 month 10% of the queries will change every day. Either they are deleted or new ones are

One of the main question am still thinking of is where to search in. Search the queries in
the documents or search the document terms in the queries. 

I will need at first exact match of a phrase query and in a next step phrase match given a

Thanks for your help,

Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov:

> If you really need to do millions of exact text queries against millions of
> documents in realtime, a simple grep is not going to be sufficient for you.
> You'll need smarter datastructures and algorithms.
> Please specify how frequently the set of *queries* changes and what you
> consider "real time".
> On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne <oliver.krohne@yieldkit.com>wrote:
>> Hi,
>> I' am evaluating different solutions for massive phrase query execution. I
>> need to execute millions of greps or more precise phrase queries consisting
>> of 1-4 terms against millions of documents. I saw the hadoop grep example
>> but this is executing grep with one regex.
>> I also saw the "Side data distribution" / "Distributed Cache" possibility
>> of hadoop. So I could pass them to the mapper and execute each query agains
>> the input line. The input line would be the entire text of an document
>> (usually 50-500 words).
>> As I am aiming to  have these information almost in realtime another
>> questions arises about adhoc map/reduce jobs. Is there a limit of running a
>> lot of jobs in parallel, lets say if I would fire a new job once a new
>> document arises. This job would only process that particular document. Or I
>> would batch 100-1000 documents and then fire the job.
>> Can anyone advise an approach of doing it with hadoop?
>> Thanks in advance,
>> Oliver
> -- 
> Eugene Kirpichov
> Principal Engineer, Mirantis Inc. http://www.mirantis.com/
> Editor, http://fprog.ru/


Oliver Krohne
Founder & CTO

YieldKit UG (haftungsbeschränkt)
Mittelweg 161
20148 Hamburg

T +49 40 209 349 771
F +49 40 209 349 779
E oliver.krohne@yieldkit.com


Sitz der Gesellschaft: Hamburg
Geschäftsführer: Sandra Tiemann, Oliver Krohne
Handelsregister: Amtsgericht Hamburg HRB 109104

View raw message