hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugene Kirpichov <ekirpic...@gmail.com>
Subject Re: execute millions of "grep"
Date Thu, 03 Nov 2011 11:37:10 GMT
Hi Oliver,
I have solved a similar problem before, and it seems to me that the best solution is to build
an automaton (with whole words, not letters on edges) on the set of queries (basically, a
"trie") that will allow you to find all queries matching a document by a single pass over
the document.
For 100k queries, the true will be very fast and very easy to build, and it will not consume
much memory.
The document scan is as follows for each word:
 - add root of trie to "active nodes" set
 - for each node in the active set, replace the node with its child corresponding to the scanned
word.

Then whenever you add a leaf node to the active set, you are seeing a match of some query
that ends at the current word.

Please tell if this description needs more detail.

03.11.2011, в 15:20, Oliver Krohne <oliver.krohne@yieldkit.com> написал(а):

> Hi Eugene,
> 
> thanks for the quick reply.
> 
> Not only the queries are changing but also the documents:
> 
> a)The long time goal would be able to process a document or a batch of documents after
a certain event which a mean with "realtime". There will be different types of documents which
are updated differently. One set documents will be updated every 1-2 hours. Other documents
will be updated once a week. So an event means that the queries have to run against that set
documents which are updated. So probably I need to run the queries every hour against against
a subset and one a week the other documents need to be searched.
> 
> b)The queries are changing a couple of times a day. At this stage we have around 100k
queries but the will grow every day. A huge set of the queries remain the same but I expect
that after 3 month 10% of the queries will change every day. Either they are deleted or new
ones are added. 
> 
> One of the main question am still thinking of is where to search in. Search the queries
in the documents or search the document terms in the queries. 
> 
> I will need at first exact match of a phrase query and in a next step phrase match given
a precision/slop.
> 
> Thanks for your help,
> Oliver
> 
> 
> Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov:
> 
>> If you really need to do millions of exact text queries against millions of
>> documents in realtime, a simple grep is not going to be sufficient for you.
>> You'll need smarter datastructures and algorithms.
>> 
>> Please specify how frequently the set of *queries* changes and what you
>> consider "real time".
>> 
>> On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne <oliver.krohne@yieldkit.com>wrote:
>> 
>>> Hi,
>>> 
>>> I' am evaluating different solutions for massive phrase query execution. I
>>> need to execute millions of greps or more precise phrase queries consisting
>>> of 1-4 terms against millions of documents. I saw the hadoop grep example
>>> but this is executing grep with one regex.
>>> 
>>> I also saw the "Side data distribution" / "Distributed Cache" possibility
>>> of hadoop. So I could pass them to the mapper and execute each query agains
>>> the input line. The input line would be the entire text of an document
>>> (usually 50-500 words).
>>> 
>>> As I am aiming to  have these information almost in realtime another
>>> questions arises about adhoc map/reduce jobs. Is there a limit of running a
>>> lot of jobs in parallel, lets say if I would fire a new job once a new
>>> document arises. This job would only process that particular document. Or I
>>> would batch 100-1000 documents and then fire the job.
>>> 
>>> Can anyone advise an approach of doing it with hadoop?
>>> 
>>> Thanks in advance,
>>> Oliver
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> Eugene Kirpichov
>> Principal Engineer, Mirantis Inc. http://www.mirantis.com/
>> Editor, http://fprog.ru/
> 
> ---
> 
> Oliver Krohne
> Founder & CTO
> 
> YieldKit UG (haftungsbeschränkt)
> Mittelweg 161
> 20148 Hamburg
> 
> T +49 40 209 349 771
> F +49 40 209 349 779
> E oliver.krohne@yieldkit.com
> 
> http://www.yieldkit.com
> 
> _____________________________________
> Sitz der Gesellschaft: Hamburg
> Geschäftsführer: Sandra Tiemann, Oliver Krohne
> Handelsregister: Amtsgericht Hamburg HRB 109104
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

Mime
View raw message