Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 39046 invoked from network); 10 Sep 2008 11:22:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Sep 2008 11:22:30 -0000 Received: (qmail 67113 invoked by uid 500); 10 Sep 2008 11:22:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 67079 invoked by uid 500); 10 Sep 2008 11:22:21 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 67068 invoked by uid 99); 10 Sep 2008 11:22:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Sep 2008 04:22:21 -0700 X-ASF-Spam-Status: No, hits=0.2 required=10.0 tests=SPF_PASS,URIBL_GREY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of marcelo.ochoa@gmail.com designates 64.233.178.249 as permitted sender) Received: from [64.233.178.249] (HELO hs-out-0708.google.com) (64.233.178.249) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Sep 2008 11:21:20 +0000 Received: by hs-out-0708.google.com with SMTP id 4so753194hsl.5 for ; Wed, 10 Sep 2008 04:21:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=FQacG40FBH7Z1TvEWE5I0Ho7kAXwtt2hwM7L+5I20uQ=; b=G+bjQG8fAhyCLV9dDBAzPpPeodzHmbpsZnvwo4oZ1sXDrciybayNPztUltIVOn+SSV k2tMkz2RXmYPThenGJrllCXTCdPqOTtToFPIs+zGQHF3OZhfRY6tQr4fd32c2clgl8bO gtL+Zybp+qfVgMHjTmID31OsRImETt1zFBolM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=kHufgkB+4Kic4pz8SIVMqXTxXF0287R9NoUEYFp0w3d+9ltZKD/X+V2S2mWjq0p57j W70HxHOj/Ozp4qVo5RrgeorxIz4z1Lr20ahBFDy3kEoD9kaMEILofcCoL0LBQ2XB2ksP IT1QIzXTKB3iHFYYmhJiOz+UKaT1Z8txYB/LQ= Received: by 10.151.109.11 with SMTP id l11mr1926099ybm.13.1221045711602; Wed, 10 Sep 2008 04:21:51 -0700 (PDT) Received: by 10.150.98.17 with HTTP; Wed, 10 Sep 2008 04:21:51 -0700 (PDT) Message-ID: <126142c0809100421p6384f229t5fae81b8982badf4@mail.gmail.com> Date: Wed, 10 Sep 2008 08:21:51 -0300 From: "Marcelo Ochoa" To: java-user@lucene.apache.org Subject: Re: Newbie question: using Lucene to index hierarchical information. In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: X-Virus-Checked: Checked by ClamAV on apache.org Hi Leonid If you are not familiar with Oracle XMLDB schema mappings here an example of how to store WikiPedia XML dumps into Oracle database, but using XML-to-relational model: http://marceloochoa.blogspot.com/2007/12/uploading-wikipedia-dumps-to-oracle.html The structure of WikiPedia dumps seem to be similar to your data model, so if you are using Oracle you can use this example as jump start to eficient mapping XML inside Oracle. Also there is an example of how to index it with Lucene running as a new Domain Index for Oracle databases, to get the best things of the two worlds :) Lucene for getting free text searching eficiently, relational DB to quick sort and filter relational data. Best regards, Marcelo. On Mon, Sep 1, 2008 at 4:25 AM, Leonid Maslov wrote: > Hi all, > > First of all, sorry for my poor English. It's not my native language. > > I'm trying to use Lucene to index hierarchical kind of information: I have > structured html and pdf/word documents and I want to index them in ways to > perform search in titles, text, paragraphs or tables only, or any > combinations of items mentioned above. At the moment I see 3 possible > solutions: > > - Create the set of all possible fields, like: contents, title, heading, > table etc... And index the data in all them accordingly. Possible impacts: > - a big count of fields > - data duplication (because I need to make search looking in the > paragraphs to look inside all the inner elements, so every outer element > indexed will contain all the inner element content as well) > - Create the hierarchy of the fields, like "title", "paragraph/title", > "paragraph/title/subparagraph/table". Possible impacts: > - count of fields remains the same > - soft set of fields (not consistent) > - I'm not sure about the ways I could process required information and > perform search. > - Performance issues? > - Use one field for content and just add location prefix to content. > For example "contents:*paragraph/heading:*token1 token2". * > paragraph/heading:* here is used as additional information prefix. So, I > (possibly?) could reuse PrefixQuery functionality or smth. Impacts: > - Strong set of index fields (small) > - Additional information processing - all the queries I'll use will > have to work as PrefixQuery > - Performance issues? > > > So, have anyone tried to make things work like that? Or am I trying to use > wrench to hammer in nails? I assume Lucene wasn't thought to be used like > that, but it's worth trying (at least asking). > Any results / suggestions are welcome! > > -- > Bests regards, > Leonid Maslov! > Adrienne Gusoff - "Opportunity knocked. My doorman threw him out." > -- Marcelo F. Ochoa http://marceloochoa.blogspot.com/ http://marcelo.ochoa.googlepages.com/home ______________ Do you Know DBPrism? Look @ DB Prism's Web Site http://www.dbprism.com.ar/index.html More info? Chapter 17 of the book "Programming the Oracle Database using Java & Web Services" http://www.amazon.com/gp/product/1555583296/ Chapter 21 of the book "Professional XML Databases" - Wrox Press http://www.amazon.com/gp/product/1861003587/ Chapter 8 of the book "Oracle & Open Source" - O'Reilly http://www.oreilly.com/catalog/oracleopen/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org