lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arjen van der Meijden <acmmail...@tweakers.net>
Subject Re: NewBie To Lucene || Perfect configuration on a 64 bit server
Date Mon, 26 May 2014 20:50:46 GMT
You don't need to worry about the 1024 maxBooleanClauses, just use a 
TermsFilter.

https://lucene.apache.org/core/4_8_0/queries/org/apache/lucene/queries/TermsFilter.html

I use it for a similar scenario, where we have a data structure that 
determines a subset of 1.5 million documents from outside Lucene. And to 
make the search (much) faster, I convert a list of ID's (primary key in 
the database) to a bunch of 'id:X' terms.

If you have other criteria (say some category id or other grouped 
selection) than those id's, you could index those alongside your 
documents and use TermsFilter's (and/or BooleanFilter with several other 
filters) to eventually make a pretty fast subset-selection.

It'll not be faster than having a dedicated 500-document database, but 
if you have to recreate that on the fly... I'd expect you to easily beat 
(with a few orders of magnitude) the time of the total procedure.

Best regards,

Arjen

On 26-5-2014 18:15 Erick Erickson wrote:
> bq: We don’t want to search on the complete document store
>
> Why not? Alexandre's comment is spot on. For 500 docs you could easily
> form a filter query like
> &fq=id1 OR id2 OR id3.... (solr-style, but easily done in Lucene). You
> get these IDs from the DB
> search. This will still be MUCH faster than indexing on the fly.
>
> The default maxBooleanClauses of 1024 if just a configuration problem,
> I've seen it at 10 times that.
>
> And you could cache the filter if you wanted and that fit your use case.
>
> Unless you _really_ can show that this solution is untenable, I think
> you're making this problem far
> too hard for yourself.
>
> If you insist on indexing these docs on the fly, you'll have to live
> with the performance hit. There's no
> real magic bullet to make your indexing sub-second. As others have
> said, indexing 500 docs seems
> like it shouldn't take as long as you're reporting. I personally
> suspect that your problem is
> somewhere in the acquisition phase. What happens if you just comment
> out all the
> code that actually does anything with Lucene and just go through the
> motions of getting
> the doc from the system-of-record in your code? My bet is that if you
> comment out the indexing
> part,  you'll find you spend 18 of your 20 seconds (SWAG).
>
> If my bet is correct, then there's _nothing_ you can do to make this
> case work as far as Lucene
> is concerned; Lucene had nothing to do with the speed issues, it's
> acquiring the docs in the first place.
>
> And if I'm wrong, then there's also virtually nothing you can do.
> Lucene is fast, very fast. You're
> apparently indexing things that are big/complex/whatever.
>
> Really, explain please why indexing all the docs and using a filter of
> the IDs from the DB
> won't work. This really, really smells like an XY problem and you have
> a flawed approach
> that is best scrapped.
>
> Best,
> Erick
>
>
> On Mon, May 26, 2014 at 6:08 AM, Alexandre Patry
> <alexandre.patry@keatext.com> wrote:
>> On 26/05/2014 05:40, Shruthi wrote:
>>>
>>> Hi All,
>>>
>>> Thanks for the suggestions. But there is a slight difference in the
>>> requirements.
>>> 1. We don't  index/ search 10 million documents for a keyword; instead we
>>> do it on only 500 documents because we are supposed to get the final result
>>> only from the 500 set of documents.
>>> 2.We have already filtered 500 documents from the 10M+ documents based on
>>> a DB Stored Procedure which has nothing to do with any kind of search
>>> keywords .
>>> 3.Our search algorithm plays a vital role on this new set of 500
>>> documents.
>>> 4.We can't avoid on the fly indexing because the  document set to be
>>> indexed is random and is ever changing .
>>>          Although we can index the existing 10M+ docs before hand and keep
>>> ready the indexes..We don’t want to search on the complete document store.
>>> Instead we only want to search on the 500 documents got above.
>>>
>>> Is there any best alternative to this requirement?
>>
>> You could index all 10 million documents and use a custom filter[1] with
>> your queries to specify which 500 documents to look at.
>>
>> Hope this help,
>>
>> Alexandre
>>
>> [1]
>> http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Filter.html
>>>
>>>
>>> Thanks,
>>>
>>> Shruthi Sethi
>>> SR. SOFTWARE ENGINEER
>>> iMedX
>>> OFFICE:
>>> 033-4001-5789 ext. N/A
>>> MOBILE:
>>> 91-9903957546
>>> EMAIL:
>>> ssethi@imedx.com
>>> WEB:
>>> www.imedx.com
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: shashi.mit@gmail.com [mailto:shashi.mit@gmail.com] On Behalf Of
>>> Shashi Kant
>>> Sent: Saturday, May 24, 2014 5:55 AM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit server
>>>
>>> To 2nd  Vitaly's suggestion. You should consider using Apache Solr
>>> instead - it handles such issues OOTB .
>>>
>>>
>>> On Fri, May 23, 2014 at 7:52 PM, Vitaly Funstein <vfunstein@gmail.com>
>>> wrote:
>>>>
>>>> At the risk of sounding overly critical here, I would say you need to
>>>> scrap
>>>> your entire approach of building one small index per request, and just
>>>> build your entire searchable data store in Lucene/Solr. This is the
>>>> simplest and probably most maintainable and scalable solution. Even if
>>>> your
>>>> index contains 10M+ documents, returning at most 500 search results
>>>> should
>>>> be lightning fast compared to the latencies you're seeing right now. To
>>>> facilitate data export from the DB, take a look at this:
>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>
>>>>
>>>> On Tue, May 20, 2014 at 7:36 AM, Shruthi <ssethi@imedx.com> wrote:
>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Toke Eskildsen [mailto:te@statsbiblioteket.dk]
>>>>> Sent: Tuesday, May 20, 2014 3:48 PM
>>>>> To: java-user@lucene.apache.org
>>>>> Subject: Re: NewBie To Lucene || Perfect configuration on a 64 bit
>>>>> server
>>>>>
>>>>> On Tue, 2014-05-20 at 11:56 +0200, Shruthi wrote:
>>>>>
>>>>> Toke:
>>>>>>
>>>>>> Is 20 second an acceptable response time for your users?
>>>>>>
>>>>>> Shruthi: Its definitely not acceptable. PFA the piece of code that
we
>>>>>> are using..Its taking 20seconds. That’s why I drafted this ticket
to
>>>>>> see where I was going wrong.
>>>>>
>>>>> Indexing 1000 documents/sec in Lucene is quite common, so even taking
>>>>> into account large documents, 20 seconds sounds like quite a bit.
>>>>> Shruthi: I had attached the code snippet in previous mail. Do you
>>>>> suspect
>>>>> a foul play there?
>>>>>
>>>>>> Shruthi: Well,  its two stage process: Client is looking at
>>>>>> historical data based on a parameters like names, dates,MRN, fields
>>>>>> etc.. SO the query actually gets the data set fulfilling the
>>>>>> requirements
>>>>>>
>>>>>> If client is interested in doing a text search then he would pass
the
>>>>>> search phrase on the result set.
>>>>>
>>>>> So it is not possible for a client to perform a broad phrase search to
>>>>> start with. And it sounds like your DB-queries are all simple matching?
>>>>> No complex joins and such? If so, this calls even more for a full
>>>>> Lucene-index solution, which handles all aspect of the search process.
>>>>> Shruthi: We call a DB stored procedure to get us the result set for
>>>>> working with..
>>>>> We will be using highlighter API and  I don’t think Memory  index can
be
>>>>> used with highlighter.
>>>>>
>>>>> - Toke Eskildsen, State and University Library, Denmark
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>
>>>
>>
>>
>> --
>> Alexandre Patry, Ph.D
>> Chercheur / Researcher
>> http://KeaText.com
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message