Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 21905 invoked from network); 8 Sep 2008 21:26:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Sep 2008 21:26:46 -0000 Received: (qmail 5414 invoked by uid 500); 8 Sep 2008 21:26:36 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 5390 invoked by uid 500); 8 Sep 2008 21:26:36 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 5379 invoked by uid 99); 8 Sep 2008 21:26:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2008 14:26:36 -0700 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=DNS_FROM_OPENWHOIS,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2008 21:25:37 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1KcoFM-0002uC-Gp for java-user@lucene.apache.org; Mon, 08 Sep 2008 14:26:08 -0700 Message-ID: <19381593.post@talk.nabble.com> Date: Mon, 8 Sep 2008 14:26:08 -0700 (PDT) From: "Karsten F." To: java-user@lucene.apache.org Subject: Re: Newbie question: using Lucene to index hierarchical information. In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: karsten-lucene@fiz-technik.de References: <19266355.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org Hi Leonid, do you really need the "Complex scenario"? what kind of query is your use case? If you really need xpath please look for xml-Databases. Otherwise you can possible use xtf out of the box, because "indexing of large structured documents" is exactly the use case for which xtf was developed (TEI documents, but html is less complex then TEI). Again the main idea: 1. Use xml-Elements (with its descendants) to divide the structured document into sections. 2. index each section as lucene document (field "text") with an extra field "section type" 3. after all sections of one structured document insert one (terminal) lucene document with the other metadata of the structured document (e.g. creation date, title, ..) the document from point 3 is the representative of the structured document (and the representative is the unit of retrieval, because the user search for a document not for a section) If you search e.g. for sectionType:table text:words inside section you have hits with the lucene documents belonging to the sections. Possible for your use case it would be enough to insert a stored lucene field "document key". In xtf the lucene document-number of each hit is incremented until the representative is reached. This is a rough description, but source code of xtf is very readable. best regards Karsten leonardinius wrote: > > Hi all, > Thanks a lot for such a quick reply. > > Both scenario sounds very well for me. I would like to do my best and try > to > implement any of them (as the proof of the concept) and then incrementally > improve, retest, investigate and rewrite then :) > > So, from the soap opera to the question part then: > > - How to implement those things (a and b) on the Lucene and Lucene > contribs codebase? > - I looked at the > > http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7 > and > didn't like that (too big, to heavy, ready-to use solution instead > of > toolkit). And I didn't understood how to implement "Normal > scenario" on top > of that? > - Any suggestions how could I begin implementing these things? Gently > moving from "Normal" scenario to some more advanced "Complex"? What > should I > afraid off and possible impacts if any? > > Have anybody tried to use Lucene to analyse things like that? What would > be > possible solutions to store indexed data and perform queries on that? If > Lucene isn't the right tool for this job, maybe some other toolkit would > more useful(possibly on top of the Lucene) > > Thanks in advance for any suggestions and comments. I would appreciate any > ideas and directions to look into. > > > On Tue, Sep 2, 2008 at 11:46 AM, Karsten F. > wrote: > >> Take a look to the xml-aware search in xtf ( >> >> http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7 >> ). >> The idea is to use one lucene-document for each section with only two >> fields: "text" and "sectionType". >> But to collect all hits belonging to one hierarchical information (e.g. >> one >> html-File) and compress this to one representative hit in lucene. >> >> Best regards >> Karsten >> > -- View this message in context: http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19381593.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org