lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Barry Coughlan <b.coughl...@gmail.com>
Subject Re: Order docIds to reduce disk seeks
Date Wed, 19 Nov 2014 10:59:32 GMT
Hi Vijay,

Could you just bypass Lucene altogether and send the documents to Carrot
from the same place that Lucene got them?

If for some reason you can not do that, here are some suggestions (note:
I'm not a Lucene expert):

1. If you have other stored fields in your index, ensure you are only
retrieving the text field: is.doc(scoreDoc.doc,
Collections.singleton("doc_text")).
2. Re-use the IndexSearcher object instead of re-opening the index for
different queries. I'm not sure from your code sample if you do this
already.
3. Time your code to ensure that retrieving the stored field is the
bottleneck in your case. If it turns out that searching is slow then you
could store your UUID using DocValues and look up the document IDs in
memory.
4. If you are querying for most of the fields in the index the it might be
more efficient to iterate through all of the stored fields. I'm not sure
how to do this with the API however.

Barry

On Tue, Nov 18, 2014 at 8:11 PM, Vijay B <vijay.nipuna@gmail.com> wrote:

> Hi Barry,
>
> here is our usecase. We fetch doc text from lucene and feed it to
> http://carrotsearch.com/ libary for generating document clusters as a text
> processing step.Carrotsearch API need to be fed with list of
> org.carrot2.core.Document
> <http://download.carrot2.org/stable/javadoc/org/carrot2/core/Document.html
> >
>  constructed out of document title and complete text.
> <
> http://download.carrot2.org/stable/javadoc/org/carrot2/core/Document.html#Document(java.lang.String
> ,
> java.lang.String, java.lang.String)>
>
>
>
>
> On Tue, Nov 18, 2014 at 2:53 PM, Barry Coughlan <b.coughlan2@gmail.com>
> wrote:
>
> > Hi Vijay,
> >
> > I'm guessing Michael means that perhaps your text processing step could
> be
> > better solved by using Lucene features. The use case of Lucene you
> describe
> > in your post is better suited to a key value store or a relational
> > database.
> >
> > Can you give more details on what your text processing step does?
> >
> > Barry
> >
> > On Nov 18, 2014 7:41 PM, "Vijay B" <vijay.nipuna@gmail.com> wrote:
> > >
> > > Hi Mike,  could you provide some pointers on using inverted index. Any
> > > examples or what API classes to use to accomplish this.
> > >
> > > On Tue, Nov 18, 2014 at 12:40 PM, Michael McCandless <
> > > lucene@mikemccandless.com> wrote:
> > >
> > > > Even if you sort all hits by docID it's likely too slow to visit
> every
> > > > single one and load the stored document ...
> > > >
> > > > Try to find another way to solve your problem, making use of the
> > inverted
> > > > index?
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > >
> > > > On Mon, Nov 17, 2014 at 6:05 PM, Rose, Stuart J <
> Stuart.Rose@pnnl.gov>
> > > > wrote:
> > > > > Hi Vijay,
> > > > >
> > > > > ...sorting the documents you need to retrieve by docID order
> first...
> > > > >
> > > > > means sorting them by their 'document number' which is the value
in
> > the
> > > > 'scoreDoc.doc' field and is the value that the reader takes to
> > 'retrieve'
> > > > the document from the index. If you write a comparator to sort the
> > elements
> > > > in the ScoreDoc[] by their doc field then that will put them in
> 'docID
> > > > order' and the reader will always be skipping forward to the next doc
> > which
> > > > will probably reduce its seek time.
> > > > >
> > > > > Regards,
> > > > > Stuart
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Vijay B [mailto:vijay.nipuna@gmail.com]
> > > > > Sent: Monday, November 17, 2014 9:16 AM
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Order docIds to reduce disk seeks
> > > > >
> > > > > *Could someone point me how to order docIds as per **
> > > > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
> > > > > <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>*
> > > > >
> > > > > *"Limit usage of stored fields and term vectors. Retrieving these
> > from
> > > > the index is quite costly. Typically you should only retrieve these
> for
> > the
> > > > current "page" the user will see, not for all documents in the full
> > result
> > > > set. For each document retrieved, Lucene must seek to a different
> > location
> > > > in various files. Try sorting the documents you need to retrieve by
> > docID
> > > > order first."*
> > > > >
> > > > > *To give some background:*
> > > > >
> > > > > *We are using plain vanilla LUCNE (version 4.2.1) for our **Our
> > > > application.**We index our documents using stored fields. We add two
> > fields
> > > > related to our documents: UUID: 9 digit number represents internal id
> > and
> > > > > doc_text: document text( 7k to 20K in size approx). In our search
> > code,
> > > > **we use boolean Query to retrive by UUID  and fetch document text
> use
> > if
> > > > for other processing. We are noticing slow response times with the
> > > > searches. I understand that stored field retrieval are slower and
> > should be
> > > > limited but this is mandatory for our app.*
> > > > >
> > > > >
> > > > > Current code:
> > > > >
> > > > > TopScoreDocCollector collector =
> > > > > TopScoreDocCollector.create(BooleanQuery.getMaxClauseCount(),
> true);
> > > > >
> > > > > dirReader = DirectoryReader.open(FSDirectory.open(......))
> > > > > IndexSearcher indexSearcher = new IndexSearcher(dirReader);
> > > > indexSearcher.search(query, collector); ScoreDoc[] scoreDocs =
> > > > collector.topDocs().scoreDocs;
> > > > >
> > > > > for (ScoreDoc scoreDoc : scoreDocs) {
> > > > > Document luceneDoc = indexSearcher.doc(scoreDoc.doc); String text
=
> > > > luceneDoc.get("doc_text"); //these calls take lot of time
> > > > >
> > > > > //process text
> > > > > }
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message