lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Askar Zaidi" <askar.za...@gmail.com>
Subject Re: Fine Tuning Lucene implementation
Date Wed, 25 Jul 2007 00:00:54 GMT
Shall I setMergeFactor = 2 ?

Slow indexing is not a bother.


On 7/24/07, Askar Zaidi <askar.zaidi@gmail.com> wrote:
>
> I ran some tests and it seems that the slowness is from Lucene calls when
> I do "doBodySearch", if I remove that call, Lucene gives me results in 5
> seconds. otherwise it takes about 50 seconds.
>
> But I need to do Body search and that field contains lots of text. The
> field is <contents>. How can I optimize that ?
>
> thanks,
> Askar
>
>
>
> On 7/24/07, Grant Ingersoll <gsingers@apache.org> wrote:
> >
> > Sorry, I mistyped. I don't mean the getXXXX methods, I mean the
> > doTagSearch, doTitleSearch, etc.
> >
> > As for the stop watch, not really sure what to make of that...  Try
> > System.currentTimeMillis()...
> >
> > You can get just the fields you want when loading a Document by using
> > the FieldSelector API on IndexReader, etc.
> >
> > Perhaps, you can also use some Filters and cache them.
> >
> > Its really hard to give suggestions when it is not at all obvious
> > where the slowness is.  Please try to isolate the Lucene calls from
> > the DB calls and look at the timings for both.
> >
> > On Jul 24, 2007, at 5:28 PM, Askar Zaidi wrote:
> >
> > > Thanks for the reply.
> > >
> > > I am timing the entire search process with a stop watch, a bit
> > > ghetto style.
> > > My getXXX methods are:
> > >
> > > Document doc = hits.doc(i);
> > > String str = doc.get("item");
> > >
> > > So you can see that I am retrieving the entire document in a search
> > > query.
> > > Ideally , I'd like to just retrieve the Field object that I want to
> > > run the
> > > search on. I know this will give me a boost as one of my Fields is
> > > really
> > > huge.
> > >
> > > My query is selecting the entire user data-set in the database. I'd
> > > like to
> > > do some SQL based search in the query too so that I pick only those
> > > items
> > > where the phrase matches.
> > >
> > > Index contains about 650MB of data. Index file size is 14478869 bytes.
> > >
> > > thanks,
> > > AZ
> > >
> > >
> > > On 7/24/07, Grant Ingersoll <gsingers@apache.org > wrote:
> > >>
> > >> Where are you getting your numbers from?  That is, where are your
> > >> timers?  Are you timing the rs.next() loop, or the individual calls
> > >> to Lucene?  What do the getXXXXX methods look like?  How big are your
> >
> > >> queries?  How big is your index?
> > >>
> > >> Essentially, we need more info to really help you.  From what I can
> > >> tell, you are generating 3 different Lucene queries for each record
> > >> in the database.  Frankly, I surprised your slowdown is only linear.
> > >>
> > >> On Jul 24, 2007, at 4:31 PM, Askar Zaidi wrote:
> > >>
> > >>> I have 512MB RAM allocated to JVM Heap. If I double my system RAM
> > >>> from 768MB
> > >>> to say 2GB or so, and give JVM 1.5GB Heap space, will I get quicker
> > >>> results
> > >>> ?
> > >>>
> > >>> Can I expect results which take 1 minute to be returned in 30
> > >>> seconds with
> > >>> more RAM ? Should I also get a more powerful CPU ? A real server
> > >>> class
> > >>> machine ?
> > >>>
> > >>> I have also done some of the optimizations that are mentioned on
> > >>> the Lucene
> > >>> website.
> > >>>
> > >>> thanks,
> > >>> AZ
> > >>>
> > >>> On 7/24/07, Askar Zaidi <askar.zaidi@gmail.com > wrote:
> > >>>>
> > >>>> Hey Guys,
> > >>>>
> > >>>> I just finished up using Lucene in my application. I have data
in a
> > >>>> database , so while indexing I extract this data from the database
> > >>>> and pump
> > >>>> it into the index. Specifically , I have the following data in
the
> > >>>> index:
> > >>>>
> > >>>> <itemID> <tags> <title> <summary> <contents>
> > >>>>
> > >>>> where itemID is just a number (primary key in the DB)
> > >>>> tags : text
> > >>>> titie: text
> > >>>> summary: text
> > >>>> contents: Huge text (text extracted from files: pdfs, docs etc).
> > >>>>
> > >>>> Now while running a search query I realized that the response time
> > >>>> increases in a linear fashion as the number of <itemID> increase
> > >>>> in the DB.
> > >>>>
> > >>>> If I have 50 items, its 8 seconds
> > >>>> 100 items, its 17 seconds.
> > >>>> 300+ items, its 60 seconds and maybe more.
> > >>>>
> > >>>> In a perfect world, I'd like to search on 300+ items within 10-15
> > >>>> seconds.
> > >>>> Can anyone give me tips to fine tune lucene ?
> > >>>>
> > >>>> Heres a code snippet:
> > >>>>
> > >>>> sql query = "SELECT itemID from items where creator = 'askar' ;
> > >>>>
> > >>>> --execute query--
> > >>>>
> > >>>> while(rs.next()){
> > >>>>
> > >>>> score = doTagSearch(askar,text,itemID);
> > >>>> scoreTitle = doTitleSearch(askar,text,itemID);
> > >>>> scoreSummary = doSummarySearch(askar,text,itemID);
> > >>>>
> > >>>> ----
> > >>>>
> > >>>> }
> > >>>>
> > >>>> So this code asks Lucene to search for the "text" in the itemID
> > >>>> passed.
> > >>>> itemID is already indexed. The while loop will run 300 times if
> > >>>> there are
> > >>>> 300 items....that gets slow...what can I do here ??
> > >>>>
> > >>>> thanks for the replies,
> > >>>>
> > >>>> AZ
> > >>>>
> > >>
> > >> --------------------------
> > >> Grant Ingersoll
> > >> Center for Natural Language Processing
> > >> http://www.cnlp.org/tech/lucene.asp
> > >>
> > >> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/
> > >> LuceneFAQ
> > >>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message