lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Leonid Maslov" <leoni...@gmail.com>
Subject Re: Newbie question: using Lucene to index hierarchical information.
Date Mon, 01 Sep 2008 13:55:51 GMT
Any comments, suggestions? Maybe I should rephrase my original message or
describe it in detail?
I really would like to get any response if possible.

Thanks a lot in advance!

On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <leonidms@gmail.com> wrote:

> Hi all,
>
> First of all, sorry for my poor English. It's not my native language.
>
> I'm trying to use Lucene to index hierarchical kind of information: I have
> structured html and pdf/word documents and I want to index them in ways to
> perform search in titles, text, paragraphs or tables only, or any
> combinations of items mentioned above. At the moment I see 3 possible
> solutions:
>
>    - Create the set of all possible fields, like: contents, title,
>    heading, table etc... And index the data in all them accordingly. Possible
>    impacts:
>    - a big count of fields
>       - data duplication (because I need to make search looking in the
>       paragraphs to look inside all the inner elements, so every outer element
>       indexed will contain all the inner element content as well)
>    - Create the hierarchy of the fields, like "title", "paragraph/title",
>    "paragraph/title/subparagraph/table". Possible impacts:
>       - count of fields remains the same
>       - soft set of fields (not consistent)
>       - I'm not sure about the ways I could process required information
>       and perform search.
>       - Performance issues?
>       - Use one field for content and just add location prefix to content.
>    For example "contents:*paragraph/heading:*token1 token2". *
>    paragraph/heading:* here is used as additional information prefix. So,
>    I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
>       - Strong set of index fields (small)
>       - Additional information processing - all the queries I'll use will
>       have to work as PrefixQuery
>       - Performance issues?
>
>
> So, have anyone tried to make things work like that? Or am I trying to use
> wrench to hammer in nails? I assume Lucene wasn't thought to be used like
> that, but it's worth trying (at least asking).
> Any results / suggestions are welcome!
>
> --
> Bests regards,
> Leonid Maslov!
> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
>



-- 
Bests regards,
Leonid Maslov!
Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message