lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kumarlimbu <kumarli...@gmail.com>
Subject Re: Serving remote lucene client - RMI vs HTTP
Date Tue, 17 Jul 2007 04:51:32 GMT

Thank you everyone for your response,

Size of our index is around 10GB. Our queries usually take similar response
times. Only exception is when we are updating our index and after the index
has been switched to a new one from the old one.

Yesterday, as Grant and Hossman pointed us to Solr and we will definitely
look at it. Solr looks like a very good implementation of searching using
lucene clusters. One of the concern for implementing searching using Solr is
that our current application is highly customized and we don't know how
customizable Solr really is (rather the ease with which we can do it). Also
Solr doesn't support queryFilter out of the box (Hossman: there's nothing to
stop a solr request handler from using QueryFilter's if they want). How much
extra work is it?

But we are very much interested in how Solr implements index distribution
over more than 1 machine. If you guys can provide other alternatives or
advise then please let me know. :)




Erick Erickson wrote:
> 
> Another question would be what queries take the longest. Are your
> response times pretty constant on a per-query basis or are there
> outliers that could perhaps point to a different solution?
> 
> Finally, what is the size of your index? The total number of documents
> is certainly useful, but so is the finished index size. I can put a
> million
> document in a 100M index or a 100G index, and size does matter <G>...
> 
> Best
> Erick
> 
> On 7/16/07, Ard Schrijvers <a.schrijvers@hippo.nl> wrote:
>>
>> Hello,
>>
>> > Hi EVeryone,
>> >
>> > Thank you all for your replies.
>> >
>> > And reply to your questions Grant:
>> > We have more than 3 Million document in our index.
>> > We get more than 150,000 searches (queries) per day. We
>> > expect this no to go
>> > up.
>>
>> Just curious, but suppose those 150.000 searches are done during 8 hours
>> a
>> day, that means around 6 searches a second...that is not too much is it?
>>
>> Perhaps a too trivial question, but do you only re-open the index reader
>> when there is actually something changed in the index (or are you
>> crawling
>> all day? perhaps you should only crawl during the night)? Reopening the
>> index over and over again while not needed is inefficient. Also, there
>> are
>> some caching mechanisms around
>>
>> Furthermore, if you use sorting in your query, you might gain much speed
>> by indexing these fields differently (like sorting on some large field is
>> very expensive)
>>
>> Then of course, do you use range searches?
>>
>> Also, did you check setting maxmemory of your jvm higher?
>>
>> One more suggestion: try to open your index with "luke" and see wether
>> "luke" is faster in doing the searches. If so, you certainly can find
>> some
>> simple ways to increase the performance,
>>
>> Ard
>>
>> > This count only includes actual user search and doesn't
>> > include robot
>> > searches and other internal searches (for sending emails to user etc).
>> > And our final queries sent to lucene are quite complicated.
>> > This is because
>> > we need to confirm to a lot of criteria (some are set by
>> > users and some are
>> > internal logics). I don't think we can simplify our queries.
>> >
>> > And importantly we are looking to scale to a larger no. of
>> > traffic. Our
>> > current performance in terms of speed is ok.
>> >
>> > We are not sure about implementing using Solr because we
>> > crawl only specific
>> > type of sites and our crawling mechanism has proven to be
>> > quite stable. We
>> > are only looking to scale to larger no. of users and doing a new
>> > implementation using Solr might be an overkill as time is a
>> > huge factor.
>> >
>> > Best Regards,
>> > Kumar
>> >
>> >
>> > Grant Ingersoll-6 wrote:
>> > >
>> > > Hi Kumar,
>> > >
>> > > I am curious about where the bulk of time was spent in
>> > Lucene.  I am
>> > > not doubting your numbers or analysis, just want to know if
>> > there was
>> > > anything that could be improved in Lucene and make sure you are
>> > > solidly in need of going to multiple machines as it adds an extra
>> > > level of complexity.  Can you provide info about number of
>> > documents,
>> > > # of updates, # of queries etc. ?  What kind of analysis, etc. are
>> > > you doing?  That being said, putting the web app on one
>> > machine and
>> > > the search application on the other makes sense much in the
>> > same way
>> > > it makes sense in most cases to do this for a database.
>> > >
>> > > As for POST/GET vs. RMI, have a look at Solr, esp. its replication
>> > > capabilities for load balancing search servers.  I think it
>> > answers
>> > > the question in favor of the POST/GET approach.  In fact,
>> > you may be
>> > > able to drop in Solr to your situation w/o too much work.
>> > >
>> > > Cheers,
>> > > Grant
>> > >
>> > >
>> > > On Jul 15, 2007, at 10:10 PM, kumarlimbu wrote:
>> > >
>> > >>
>> > >> Hi Everyone,
>> > >>
>> > >> We are using lucene,nutch and spring framework to create a
>> > specialized
>> > >> search engine. Due to growing traffic we are trying to scale. By
>> > >> doing some
>> > >> tests we found out that the bottle neck was lucene search.
>> > We used
>> > >> some
>> > >> heavy traffic simulation and logged the time taken by each
>> > portion
>> > >> of the
>> > >> server response and found out that the bulk of the time
>> > was spent in
>> > >> searching from lucene index.
>> > >>
>> > >> In order to accomodate higher traffic we are planning on
>> > splitting our
>> > >> application in 2 portions:
>> > >> 1. Web application (on 1 machine)
>> > >> 2. Search application (one more than 1 machine)
>> > >>
>> > >> Each one of the application will reside on a (possibly) separate
>> > >> machines.
>> > >> We are looking forward to scaling by adding more than 1 machine
>> > >> dedicated to
>> > >> searching as lucene search seems to be the bottleneck.
>> > >>
>> > >> Web application will provide the front-end to the user.
>> > All static
>> > >> pages,
>> > >> images and the style information will reside on this machine. It
>> > >> will also
>> > >> serve dynamic pages but all the searching will take place on the
>> > >> search
>> > >> application. Web application will send search parameters
>> > to the search
>> > >> application and after searching, it will send back the results to
>> > >> the web
>> > >> application which will format it and display it to the user.
>> > >>
>> > >> What we are unable to decide is whether to use RMI (
>> > >>
>> http://www.soft-amis.com/index.html?return=http://www.soft-amis.com/
>> >> cluster4spring/index.html
>> >> cluster4spring  )or simple HTTP POST/GET request with response in XML
>> >> format. We did some research and found out that RMI (with clustering
>> >> support) might be more suitable for our needs. Unfortunately our
>> >> team is not
>> >> familiar with RMI and so we don't know if there will be any issues
>> >> with it
>> >> during implementation.
>> >> Advantage of using simple GET/POST is we have more control which
>> >> searcher
>> >> app to use and when. An important criteria for us is to disable
>> >> searching
>> >> from the searcher app who's index is being updated. This is very
>> >> important
>> >> for us. Is there anyway in RMI to inform the web (client)
>> >> application that a
>> >> particular server is unavailable (due to index being updated).
>> >>
>> >> We would also like to know if anybody has implemented lucene
>> >> searching in a
>> >> similar fashion. Thanks for your help.
>> >>
>> >> Kumar Limbu
>> >>
>> >> --
>> >> View this message in context: http://www.nabble.com/Serving-remote-
>> >> lucene-client---RMI-vs-HTTP-tf4084167.html#a11608209
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >
>> > --------------------------
>> > Grant Ingersoll
>> > Center for Natural Language Processing
>> > http://www.cnlp.org/tech/lucene.asp
>> >
>> > Read the Lucene Java FAQ at
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Serving-remote-lucene-client---RMI-vs-HTTP-tf4084167.html#a11609007
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Serving-remote-lucene-client---RMI-vs-HTTP-tf4084167.html#a11643027
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message