lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Leonid M." <leoni...@gmail.com>
Subject Re: Newbie question: using Lucene to index hierarchical information.
Date Wed, 10 Sep 2008 09:12:01 GMT
Hi Karsten,
Thanks a lot. I finally have got Your idea.

Ok, I think it's worth to do the real job now :) Thanks for the advices,
finally I have understood the directions I could go for it.

>
>  do you really need the "Complex scenario"?
>
what kind of query is your use case?

My Query UC is smth like this: find documents where paragraphs are similar
to this document paragraphs or paragraph or part of it (using N-Gramms or
similar/modified tokenenizers and Stemm/NLP like similarity).

I finally understood the idea behind XML-based approach. I think XML based
approach isn't suitable for me anyway for some reasons:

   - DB support (MSSQL and Oracle or some Java ad-hoc solutions)
   - Speed with XPATH like queries on big datasets.

So I assume the the variant You recommend suits me the best.
However it's hard to understand what xtf does by just opening it's source
code and being newbie in Lucene. But thats should be done - should be done,
no one will do my job for me anyway. :))

I'll try to make some time to digg in xtf code. If smth is unclear or
questionable - I assume xtf mailing list would be the right place to ask -
not this particularly one (java-lucene-user)?

Thanks a lot for pointing out possible directions and solutions. I really
appreciate You help and time You spent to provide such as helpful
descriptions. God bless OSS community!

On Tue, Sep 9, 2008 at 12:26 AM, Karsten F.
<karsten-lucene@fiz-technik.de>wrote:

>
> Hi Leonid,
>
> do you really need the "Complex scenario"?
> what kind of query is your use case?
>
> If you really need xpath please look for xml-Databases.
>
> Otherwise you can possible use xtf out of the box, because "indexing of
> large structured documents" is exactly the use case for which xtf was
> developed (TEI documents, but html is less complex then TEI).
> Again the main idea:
> 1. Use xml-Elements (with its descendants) to divide the structured
> document
> into sections.
> 2. index each section as lucene document (field "text") with an extra field
> "section type"
> 3. after all sections of one structured document insert one (terminal)
> lucene document with the other metadata of the structured document (e.g.
> creation date, title, ..)
>
> the document from point 3 is the representative of the structured document
> (and the representative is the unit of retrieval, because the user search
> for a document not for a section)
> If you search e.g. for
> sectionType:table text:words inside section
> you have hits with the lucene documents belonging to the sections.
>
> Possible for your use case it would be enough to insert a stored lucene
> field "document key".
> In xtf the lucene document-number of each hit is incremented until the
> representative is reached.
>
> This is a rough description, but source code of xtf is very readable.
>
> best regards
>
>  Karsten
>
>
>
> leonardinius wrote:
> >
> > Hi all,
> > Thanks a lot for such a quick reply.
> >
> > Both scenario sounds very well for me. I would like to do my best and try
> > to
> > implement any of them (as the proof of the concept) and then
> incrementally
> > improve, retest, investigate and rewrite then :)
> >
> > So, from the soap opera to the question part then:
> >
> >    - How to implement those things (a and b) on the Lucene and Lucene
> >    contribs codebase?
> >       - I looked at the
> >
> >
> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
> > and
> >       didn't like that (too big, to heavy, ready-to use solution instead
> > of
> >       toolkit). And I didn't understood how to implement "Normal
> > scenario" on top
> >       of that?
> >    - Any suggestions how could I begin implementing these things? Gently
> >    moving from "Normal" scenario to some more advanced "Complex"? What
> > should I
> >    afraid off and possible impacts if any?
> >
> > Have anybody tried to use Lucene to analyse things like that? What would
> > be
> > possible solutions to store indexed data and perform queries on that? If
> > Lucene isn't the right tool for this job, maybe some other toolkit would
> > more useful(possibly on top of the Lucene)
> >
> > Thanks in advance for any suggestions and comments. I would appreciate
> any
> > ideas and directions to look into.
> >
> >
> > On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
> > <karsten-lucene@fiz-technik.de>wrote:
> >
> >> Take a look to the xml-aware search in xtf (
> >>
> >>
> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
> >> ).
> >> The idea is to use one lucene-document for each section with only two
> >> fields: "text" and "sectionType".
> >> But to collect all hits belonging to one hierarchical information (e.g.
> >> one
> >> html-File) and compress this to one representative hit in lucene.
> >>
> >> Best regards
> >>  Karsten
> >>
> >
>
> --
> View this message in context:
> http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19381593.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Bests regards,
Leonid Maslov!
Personal blog: http://leonardinius.blogspot.com/

Random thought:
Marcel Marceau  - "Never get a mime talking. He won't stop."

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message