Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates
 216.139.236.158 as permitted sender)
Message-ID: <19381593.post@talk.nabble.com>
Date: Mon, 8 Sep 2008 14:26:08 -0700 (PDT)
From: "Karsten F." <karsten-lucene@fiz-technik.de>
To: java-user@lucene.apache.org
Subject: Re: Newbie question: using Lucene to index hierarchical
 information.
In-Reply-To: <dcd377ca0809041230i5b1be63fl7c3b8e8118e2c9db@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
References: <dcd377ca0809010025s1281eff7r5716c444de2fb671@mail.gmail.com>
 <dcd377ca0809010655y2a9a1bc0i3fe9e2a81d34f1e8@mail.gmail.com>
 <19266355.post@talk.nabble.com>
 <dcd377ca0809041230i5b1be63fl7c3b8e8118e2c9db@mail.gmail.com>


Hi Leonid,

do you really need the "Complex scenario"?
what kind of query is your use case?

If you really need xpath please look for xml-Databases.

Otherwise you can possible use xtf out of the box, because "indexing of
large structured documents" is exactly the use case for which xtf was
developed (TEI documents, but html is less complex then TEI). 
Again the main idea:
1. Use xml-Elements (with its descendants) to divide the structured document
into sections.
2. index each section as lucene document (field "text") with an extra field
"section type"
3. after all sections of one structured document insert one (terminal)
lucene document with the other metadata of the structured document (e.g.
creation date, title, ..)

the document from point 3 is the representative of the structured document
(and the representative is the unit of retrieval, because the user search
for a document not for a section)
If you search e.g. for 
sectionType:table text:words inside section
you have hits with the lucene documents belonging to the sections.

Possible for your use case it would be enough to insert a stored lucene
field "document key".
In xtf the lucene document-number of each hit is incremented until the
representative is reached.

This is a rough description, but source code of xtf is very readable.

best regards

  Karsten


leonardinius wrote:
> 
> Hi all,
> Thanks a lot for such a quick reply.
> 
> Both scenario sounds very well for me. I would like to do my best and try
> to
> implement any of them (as the proof of the concept) and then incrementally
> improve, retest, investigate and rewrite then :)
> 
> So, from the soap opera to the question part then:
> 
>    - How to implement those things (a and b) on the Lucene and Lucene
>    contribs codebase?
>       - I looked at the
>      
> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
> and
>       didn't like that (too big, to heavy, ready-to use solution instead
> of
>       toolkit). And I didn't understood how to implement "Normal
> scenario" on top
>       of that?
>    - Any suggestions how could I begin implementing these things? Gently
>    moving from "Normal" scenario to some more advanced "Complex"? What
> should I
>    afraid off and possible impacts if any?
> 
> Have anybody tried to use Lucene to analyse things like that? What would
> be
> possible solutions to store indexed data and perform queries on that? If
> Lucene isn't the right tool for this job, maybe some other toolkit would
> more useful(possibly on top of the Lucene)
> 
> Thanks in advance for any suggestions and comments. I would appreciate any
> ideas and directions to look into.
> 
> 
> On Tue, Sep 2, 2008 at 11:46 AM, Karsten F.
> <karsten-lucene@fiz-technik.de>wrote:
> 
>> Take a look to the xml-aware search in xtf (
>>
>> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
>> ).
>> The idea is to use one lucene-document for each section with only two
>> fields: "text" and "sectionType".
>> But to collect all hits belonging to one hierarchical information (e.g.
>> one
>> html-File) and compress this to one representative hit in lucene.
>>
>> Best regards
>>  Karsten
>>
> 

-- 
View this message in context: http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19381593.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org