Hi!
In essence:
1) I don't care about the whole page
2) I only care about the actual sentence that matches the query.
3) I want the matching for the query only to happen within one sentence and
not over sentence boundaries (even when I do a PhraseQuery with some slop).
The query: "i like the beach"~20
should not match: "And we go to the restaurant and i really like it. the
beach was wonderful as well".
4) I would much prefer not to parse the actual page to find the sentence
that matches the query (though I obviously will, if I have to).
Does that answer your question?
Thanks!
Jochen
> -----Original Message-----
> From: Dan Quaroni [mailto:dquaroni@OPENRATINGS.com]
> Sent: Wednesday, December 17, 2003 1:19 PM
> To: 'Lucene Users List'
> Subject: RE: Indexing Speed: Documents vs. Sentences
>
> I'm confused about something - what's the point of creating a document for
> every sentence?
>
> -----Original Message-----
> From: Jochen Frey [mailto:jochen_frey@yahoo.com]
> Sent: Wednesday, December 17, 2003 4:17 PM
> To: 'Lucene Users List'
> Subject: Indexing Speed: Documents vs. Sentences
>
>
> Hi,
>
> I am using Lucene to index a large number of web pages (a few 100GB) and
> the
> indexing speed is great.
>
> Lately I have been trying to index on a sentence level, not the document
> level. My problem is that the indexing speed has gone down dramatically
> and
> I am wondering if there is any way for me to improve on that.
>
> Indexing on a sentence level the overall amount of data stays the same
> while
> the number of records increases substantially (since there is usually many
> sentences to one web page).
>
> It seems to me like the indexing speed (everything else being the same)
> depends largely on the number of Documents inserted into the index, and
> not
> so much on the size of the data within the documents (correct?).
>
> I have played with the merge factor, using RAMDirectory, etc and I am
> quite
> comfortable with our overall configuration, so my guess is that that is
> not
> the issue (and I am QUITE happy with the indexing speed as long as I use
> complete pages and not sentences).
>
> Maybe there is a different way of attacking this? My goal is to be able to
> execute a query and get the sentences that match the query in the most
> efficient way while maintaining good/great indexing speed. I would prefer
> not having to search the complete document for the sentence in question.
>
> My current solution is to have one Lucene Document for each page
> (containing
> the URL and other information I require) that does NOT contain the text of
> the page. Then I have one Lucene Document for each sentence within that
> document, which contains the text of this particular sentence in addition
> to
> some identifying information that references the entry of the page itself.
>
> Any and all suggestions are welcome.
>
> Thanks!
> Jochen
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
|