lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jochen Frey" <jochen_f...@yahoo.com>
Subject RE: Indexing Speed: Documents vs. Sentences
Date Wed, 17 Dec 2003 21:33:59 GMT
Right.

However, even if I do that, my problem #3 below remains unsolved: I do not
wish to match phrases across sentence boundaries.

Anyone have a neat solution (or pointers to one)?

Thanks again!
Jochen

> -----Original Message-----
> From: Dan Quaroni [mailto:dquaroni@OPENRATINGS.com]
> Sent: Wednesday, December 17, 2003 1:29 PM
> To: 'Lucene Users List'
> Subject: RE: Indexing Speed: Documents vs. Sentences
> 
> Yeah.  I'd suggest parsing the page, unfortunately. :)
> 
> -----Original Message-----
> From: Jochen Frey [mailto:jochen_frey@yahoo.com]
> Sent: Wednesday, December 17, 2003 4:26 PM
> To: 'Lucene Users List'
> Subject: RE: Indexing Speed: Documents vs. Sentences
> 
> 
> Hi!
> 
> In essence:
> 1) I don't care about the whole page
> 
> 2) I only care about the actual sentence that matches the query.
> 
> 3) I want the matching for the query only to happen within one sentence
> and
> not over sentence boundaries (even when I do a PhraseQuery with some
> slop).
> 
> The query: "i like the beach"~20
> should not match: "And we go to the restaurant and i really like it. the
> beach was wonderful as well".
> 
> 4) I would much prefer not to parse the actual page to find the sentence
> that matches the query (though I obviously will, if I have to).
> 
> Does that answer your question?
> 
> Thanks!
> Jochen
> 
> > -----Original Message-----
> > From: Dan Quaroni [mailto:dquaroni@OPENRATINGS.com]
> > Sent: Wednesday, December 17, 2003 1:19 PM
> > To: 'Lucene Users List'
> > Subject: RE: Indexing Speed: Documents vs. Sentences
> >
> > I'm confused about something - what's the point of creating a document
> for
> > every sentence?
> >
> > -----Original Message-----
> > From: Jochen Frey [mailto:jochen_frey@yahoo.com]
> > Sent: Wednesday, December 17, 2003 4:17 PM
> > To: 'Lucene Users List'
> > Subject: Indexing Speed: Documents vs. Sentences
> >
> >
> > Hi,
> >
> > I am using Lucene to index a large number of web pages (a few 100GB) and
> > the
> > indexing speed is great.
> >
> > Lately I have been trying to index on a sentence level, not the document
> > level. My problem is that the indexing speed has gone down dramatically
> > and
> > I am wondering if there is any way for me to improve on that.
> >
> > Indexing on a sentence level the overall amount of data stays the same
> > while
> > the number of records increases substantially (since there is usually
> many
> > sentences to one web page).
> >
> > It seems to me like the indexing speed (everything else being the same)
> > depends largely on the number of Documents inserted into the index, and
> > not
> > so much on the size of the data within the documents (correct?).
> >
> > I have played with the merge factor, using RAMDirectory, etc and I am
> > quite
> > comfortable with our overall configuration, so my guess is that that is
> > not
> > the issue (and I am QUITE happy with the indexing speed as long as I use
> > complete pages and not sentences).
> >
> > Maybe there is a different way of attacking this? My goal is to be able
> to
> > execute a query and get the sentences that match the query in the most
> > efficient way while maintaining good/great indexing speed. I would
> prefer
> > not having to search the complete document for the sentence in question.
> >
> > My current solution is to have one Lucene Document for each page
> > (containing
> > the URL and other information I require) that does NOT contain the text
> of
> > the page. Then I have one Lucene Document for each sentence within that
> > document, which contains the text of this particular sentence in
> addition
> > to
> > some identifying information that references the entry of the page
> itself.
> >
> > Any and all suggestions are welcome.
> >
> > Thanks!
> > Jochen
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message