lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Is it possible to search for a paragraph in Lucene?
Date Mon, 12 Sep 2016 15:12:04 GMT
First, _you_ define a "paragraph". It's one of those
tricky concepts that's totally obvious to a human
but is surprisingly hard to implement in code. What's
a paragraph in Chinese? Hebrew? Even in English
it's tricky.. How does a PDF signal a paragraph? Is
that consistent with Word? Open Office? How
about an HTML page? The <p> tag isn't consistently
used....

So no, Lucene doesn't have any knowledge of
paragraph, there's nothing built in to even try to
detect such an abstract concept. As Ahmet suggests,
there are tools out there you can try that will attempt
to detect where paragraphs are in your documents.
>From there, I'd suggest that you index paragraphs with
a large position offset for the first word of each one,
then you can search for phrases with a "slop" less
than that gap.

Best,
Erick



On Mon, Sep 12, 2016 at 7:25 AM, szzoli <reg9szabo@freemail.hu> wrote:
> Hi,
>
> thanks for the hint.
>
> My question exatly is:
>
> Can I use a paragraph of a document to use as a term to search in the index?
> Does Lucene create an inde only on word level, or can it be set to index on
> phrase, or paragraph level? Is it the question of indexing or of searching
> to search for several words?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Is-it-possible-to-search-for-a-paragraph-in-Lucene-tp4295705p4295779.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message