lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reuven Ivgi" <Reuv...@intellinx-sw.com>
Subject RE: Define end-of-paragraph
Date Tue, 03 Oct 2006 09:27:32 GMT
Hello,
To be more precise, the basic entity I am using is a document, each with
paragraphs which may be up to few thousands. I need the proximity search
within a paragraph, yet, I want to get as a search result the paragraph
number also. Maybe, defining each paragraph as separate field it the
best way
What do you think?
Thanks in advance 

Reuven Ivgi

-----Original Message-----
From: Chuck Williams [mailto:chuck@manawiz.com] 
Sent: Tuesday, October 03, 2006 10:58 AM
To: java-dev@lucene.apache.org
Subject: Re: Define end-of-paragraph


Reuven Ivgi wrote on 10/02/2006 09:32 PM:
> I want to divide a document to paragraphs, still having proximity
search
> within each paragraph
>
> How can I do that?
>   

Is your issue that you want the paragraphs to be in a single document,
but you want to limit proximity search to find matches only within a
single paragraph?  If so, you could parse your document into paragraphs
and when generating tokens for it place large gaps at the paragraph
boundaries.  Each Token in lucene has a startOffset and endOffset that
you can set as you generate Tokens inside TokenStream.next() for the
TokenStream returned by your Analyzer.  Those classes and methods are
all in org.apache.lucene.analysis.  Or alternatively, you could make
each paragraph a separate field value and use
Analyzer.getPositionIncrementGap() to achieve essentially the same thing
(except that your Documents could get unwieldy if you that have many
paragraphs).

If this is not what you are trying to do, then please explain your
objectives precisely.

Good luck,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message