lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "none none" <kor...@lycos.com>
Subject RE: Indexing Speed: Documents vs. Sentences
Date Thu, 18 Dec 2003 02:51:44 GMT
Nothing magic :(
I think the only way possible is to build a DocumentParser, parse the file content to split
into sentences using some punctuation criteria, like:
. new sentence
, same sentence ?
; new sentence
: same ?
( new )end new same for  {[ - ]}
? ! new sentence
etc.
You could use the JavaCC package.
As soon as a sentence is identified you will create a Sentence class, Sentence code should
look like:

import org.apache.lucene.document.*;
...

public static Document Document(String sentence, String doc_id , [any other fields you need])

{
  Document doc = new Document();
  Reader reader = new StringReader(sentence);
  doc.add(Field.Text("sentence", reader));
  doc.add(Field.Keyword("document_id",doc_id);
  // etc...
}

}

this should look familiar, what we do is use a sentence as a "Document", of course it will
be slow, but you can improve the speed of indexing by setting maxFieldLenght to some few hudrend
instead of 10000 (default value), because usually a sentence is shorter than that.
Another improvement could be skip stop words, if you can this will reduce the size of your
index dramatically.
hope this helps.
ciao.

---
KorfuT

--------- Original Message ---------

DATE: Wed, 17 Dec 2003 13:33:59
From: "Jochen Frey" <jochen_frey@yahoo.com>
To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
Cc: 

>Right.
>
>However, even if I do that, my problem #3 below remains unsolved: I do not
>wish to match phrases across sentence boundaries.
>
>Anyone have a neat solution (or pointers to one)?
>
>Thanks again!
>Jochen
>
>> -----Original Message-----
>> From: Dan Quaroni [mailto:dquaroni@OPENRATINGS.com]
>> Sent: Wednesday, December 17, 2003 1:29 PM
>> To: 'Lucene Users List'
>> Subject: RE: Indexing Speed: Documents vs. Sentences
>> 
>> Yeah.  I'd suggest parsing the page, unfortunately. :)
>> 
>> -----Original Message-----
>> From: Jochen Frey [mailto:jochen_frey@yahoo.com]
>> Sent: Wednesday, December 17, 2003 4:26 PM
>> To: 'Lucene Users List'
>> Subject: RE: Indexing Speed: Documents vs. Sentences
>> 
>> 
>> Hi!
>> 
>> In essence:
>> 1) I don't care about the whole page
>> 
>> 2) I only care about the actual sentence that matches the query.
>> 
>> 3) I want the matching for the query only to happen within one sentence
>> and
>> not over sentence boundaries (even when I do a PhraseQuery with some
>> slop).
>> 
>> The query: "i like the beach"~20
>> should not match: "And we go to the restaurant and i really like it. the
>> beach was wonderful as well".
>> 
>> 4) I would much prefer not to parse the actual page to find the sentence
>> that matches the query (though I obviously will, if I have to).
>> 
>> Does that answer your question?
>> 
>> Thanks!
>> Jochen
>> 
>> > -----Original Message-----
>> > From: Dan Quaroni [mailto:dquaroni@OPENRATINGS.com]
>> > Sent: Wednesday, December 17, 2003 1:19 PM
>> > To: 'Lucene Users List'
>> > Subject: RE: Indexing Speed: Documents vs. Sentences
>> >
>> > I'm confused about something - what's the point of creating a document
>> for
>> > every sentence?
>> >
>> > -----Original Message-----
>> > From: Jochen Frey [mailto:jochen_frey@yahoo.com]
>> > Sent: Wednesday, December 17, 2003 4:17 PM
>> > To: 'Lucene Users List'
>> > Subject: Indexing Speed: Documents vs. Sentences
>> >
>> >
>> > Hi,
>> >
>> > I am using Lucene to index a large number of web pages (a few 100GB) and
>> > the
>> > indexing speed is great.
>> >
>> > Lately I have been trying to index on a sentence level, not the document
>> > level. My problem is that the indexing speed has gone down dramatically
>> > and
>> > I am wondering if there is any way for me to improve on that.
>> >
>> > Indexing on a sentence level the overall amount of data stays the same
>> > while
>> > the number of records increases substantially (since there is usually
>> many
>> > sentences to one web page).
>> >
>> > It seems to me like the indexing speed (everything else being the same)
>> > depends largely on the number of Documents inserted into the index, and
>> > not
>> > so much on the size of the data within the documents (correct?).
>> >
>> > I have played with the merge factor, using RAMDirectory, etc and I am
>> > quite
>> > comfortable with our overall configuration, so my guess is that that is
>> > not
>> > the issue (and I am QUITE happy with the indexing speed as long as I use
>> > complete pages and not sentences).
>> >
>> > Maybe there is a different way of attacking this? My goal is to be able
>> to
>> > execute a query and get the sentences that match the query in the most
>> > efficient way while maintaining good/great indexing speed. I would
>> prefer
>> > not having to search the complete document for the sentence in question.
>> >
>> > My current solution is to have one Lucene Document for each page
>> > (containing
>> > the URL and other information I require) that does NOT contain the text
>> of
>> > the page. Then I have one Lucene Document for each sentence within that
>> > document, which contains the text of this particular sentence in
>> addition
>> > to
>> > some identifying information that references the entry of the page
>> itself.
>> >
>> > Any and all suggestions are welcome.
>> >
>> > Thanks!
>> > Jochen
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>



____________________________________________________________
Free Poetry Contest. Win $10,000. Submit your poem @ Poetry.com!
http://ad.doubleclick.net/clk;6750922;3807821;l?http://www.poetry.com/contest/contest.asp?Suite=A59101

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message