lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Leonid Maslov" <leoni...@gmail.com>
Subject Re: Newbie question: using Lucene to index hierarchical information.
Date Thu, 04 Sep 2008 19:30:07 GMT
Hi all,
Thanks a lot for such a quick reply.

Both scenario sounds very well for me. I would like to do my best and try to
implement any of them (as the proof of the concept) and then incrementally
improve, retest, investigate and rewrite then :)

So, from the soap opera to the question part then:

   - How to implement those things (a and b) on the Lucene and Lucene
   contribs codebase?
      - I looked at the
      http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
and
      didn't like that (too big, to heavy, ready-to use solution instead of
      toolkit). And I didn't understood how to implement "Normal
scenario" on top
      of that?
   - Any suggestions how could I begin implementing these things? Gently
   moving from "Normal" scenario to some more advanced "Complex"? What should I
   afraid off and possible impacts if any?

Have anybody tried to use Lucene to analyse things like that? What would be
possible solutions to store indexed data and perform queries on that? If
Lucene isn't the right tool for this job, maybe some other toolkit would
more useful(possibly on top of the Lucene)

Thanks in advance for any suggestions and comments. I would appreciate any
ideas and directions to look into.


On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
<karsten-lucene@fiz-technik.de>wrote:

>
> Hi Leonid,
>
> what kind of query is your use case?
>
> Comlex scenario:
> You need all the hierarchical structure information in one query. This
> means
> you want to search with xpath in a real xml-Database. (like: All Documents
> with a subtitle XY which contains directly after this subtitle a table with
> the same column like ...)
>
> Normal scenario:
> You want to search for only one part of your hierarchical information like
> 'Document with word xy in title' and 'Documents with word xy in table'.
>
> I am not familiar with lucene use in xml-Databases, but I can advice for
> "normal scenario":
>
> Take a look to the xml-aware search in xtf (
>
> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
> ).
> The idea is to use one lucene-document for each section with only two
> fields: "text" and "sectionType".
> But to collect all hits belonging to one hierarchical information (e.g. one
> html-File) and compress this to one representative hit in lucene.
>
> Best regards
>  Karsten
>
>
> leonardinius wrote:
> >
> > Any comments, suggestions? Maybe I should rephrase my original message or
> > describe it in detail?
> > I really would like to get any response if possible.
> >
> > Thanks a lot in advance!
> >
> > On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <leonidms@gmail.com>
> wrote:
> >
> >> Hi all,
> >>
> >> First of all, sorry for my poor English. It's not my native language.
> >>
> >> I'm trying to use Lucene to index hierarchical kind of information: I
> >> have
> >> structured html and pdf/word documents and I want to index them in ways
> >> to
> >> perform search in titles, text, paragraphs or tables only, or any
> >> combinations of items mentioned above. At the moment I see 3 possible
> >> solutions:
> >>
> >>    - Create the set of all possible fields, like: contents, title,
> >>    heading, table etc... And index the data in all them accordingly.
> >> Possible
> >>    impacts:
> >>    - a big count of fields
> >>       - data duplication (because I need to make search looking in the
> >>       paragraphs to look inside all the inner elements, so every outer
> >> element
> >>       indexed will contain all the inner element content as well)
> >>    - Create the hierarchy of the fields, like "title",
> "paragraph/title",
> >>    "paragraph/title/subparagraph/table". Possible impacts:
> >>       - count of fields remains the same
> >>       - soft set of fields (not consistent)
> >>       - I'm not sure about the ways I could process required information
> >>       and perform search.
> >>       - Performance issues?
> >>       - Use one field for content and just add location prefix to
> >> content.
> >>    For example "contents:*paragraph/heading:*token1 token2". *
> >>    paragraph/heading:* here is used as additional information prefix.
> So,
> >>    I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
> >>       - Strong set of index fields (small)
> >>       - Additional information processing - all the queries I'll use
> will
> >>       have to work as PrefixQuery
> >>       - Performance issues?
> >>
> >>
> >> So, have anyone tried to make things work like that? Or am I trying to
> >> use
> >> wrench to hammer in nails? I assume Lucene wasn't thought to be used
> like
> >> that, but it's worth trying (at least asking).
> >> Any results / suggestions are welcome!
> >>
> >> --
> >> Bests regards,
> >> Leonid Maslov!
> >> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
> >>
> >
> >
> >
> > --
> > Bests regards,
> > Leonid Maslov!
> > Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19266355.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Bests regards,
Leonid Maslov!
Princess Margaret  - "I have as much privacy as a goldfish in a bowl."

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message