lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Melanie Langlois" <Melanie.Langl...@tradingscreen.com>
Subject RE: Reverse search
Date Fri, 23 Mar 2007 08:57:24 GMT
Well, I though to use the PerFieldAnalyzerWrapper which contains as basic the snowballAnalyzer
with English stopwords and use snowballAnalyzer with language specific keywords for the fields
which will be in different languages. But I'm seeing that in your MemoryIndexTest you commented
the use of SnowballAnalyzer, is it because it's too slow. In this case, I think I could use
the StandardAnalyzer... what do you think?

Mélanie 
  
-----Original Message-----
From: karl wettin [mailto:karl.wettin@gmail.com] 
Sent: Friday, March 23, 2007 12:46 PM
To: java-user@lucene.apache.org
Subject: Re: Reverse search


23 mar 2007 kl. 03.07 skrev Melanie Langlois:

> Thanks Karl, the performances graph is really amazing!
> I have to say that it would not have think this way around would be  
> faster, but sounds nice if I can use this, make everything easier  
> to manage. I'm just wondering what did you consider when you build  
> your graph, only the time to run the queries? Because, I should add  
> the time for creating the index anytime a new document comes in (or  
> a subset of documents if several comes in same time), and the  
> indexing of these documents. The documents should not be big,  
> around 2KB. Did you measure this part ?

Adding a document to a MemoryIndex or InstantiatedIndex takes more or  
less the same time it would take to add it to an empty RAMDirectory.  
How many clock ticks is spent really depends on what analysers you use.

-- 
karl

>
> Mélanie
>
> -----Original Message-----
> From: karl wettin [mailto:karl.wettin@gmail.com]
> Sent: Friday, March 23, 2007 10:35 AM
> To: java-user@lucene.apache.org
> Subject: Re: Reverse search
>
>
> 23 mar 2007 kl. 02.12 skrev Melanie Langlois:
>
>> I want to manage user subscriptions to specific documents. So I
>> would like to store the subscription (query) into the lucene
>> directory, and whenever I receive a new document, I will search all
>> the matching subscriptions to send the documents to all subcribers.
>> For instance if a user subscribes to all documents with text
>> containing (WORD1 and WORD2) or WORD3, how can I match the incoming
>> document based on stored subscriptions? I was thinking to have two
>> subfields for each field of the subscription: the AND conditions
>> and the OR conditions.
>>
>> -OR. I will tokenized the document field content and insert OR
>> between each of them, and run the query against OR condition of
>> subscription
>>
>> -It's for the AND that I will have an issue, because if the
>> incoming text may contains more words than the sequence I want to
>> search.
>>
>> For instance, if I subscribe for documents contents lucene and java
>> for instance , if the incoming document contents is lucene is a
>> great API which has been developed in java, once I removed
>> stopwords my query would look like lucene and great and API and
>> developed and java.
>>
>> As query is composed of more words than the stored subscription I
>> will fail to retrieve the subscription. But if I put only or words,
>> the results will not be accurate, as I can obtain subscription only
>> for java for instance.
>>
>
> I wrote such a thing way back, where I used the new document as the
> query and the user subscriptions as the index. Similar to what you
> describe, I had an AND, OR and NOT field. This really limited the
> type of queries users could store. It does however work, particullary
> well on systems with /huge/ amounts of subscriptions (many millions).
>
> Today I would have used something else. If you insert one document at
> the time to your index, take a look at MemoryIndex in contrib. If you
> insert documents in batches larger than one document at the time,
> take a look at LUCENE-550 in the Jira. Add new documents to such an
> index and place the subscribed queries on it. Depening on the
> queries, the speed should be some 20-100 times faster than using a
> RAMDirectory. One million queries should take some 20 seconds to
> assemble and place on a 25 document index on my laptop. See <https://
> issues.apache.org/jira/secure/attachment/
> 12353601/12353601_HitCollectionBench.jpg> for performace of  
> LUCENE-550.
>
> -- 
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message