lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Ochoa" <marcelo.oc...@gmail.com>
Subject Re: Newbie question: using Lucene to index hierarchical information.
Date Wed, 10 Sep 2008 11:21:51 GMT
Hi Leonid
   If you are not familiar with Oracle XMLDB schema mappings here an
example of how to store WikiPedia XML dumps into Oracle database, but
using XML-to-relational model:
http://marceloochoa.blogspot.com/2007/12/uploading-wikipedia-dumps-to-oracle.html
   The structure of WikiPedia dumps seem to be similar to your data
model, so if you are using Oracle you can use this example as jump
start to eficient mapping XML inside Oracle.
   Also there is an example of how to index it with Lucene running as
a new Domain Index for Oracle databases, to get the best things of the
two worlds :) Lucene for getting free text searching eficiently,
relational DB to quick sort and filter relational data.
   Best regards, Marcelo.
On Mon, Sep 1, 2008 at 4:25 AM, Leonid Maslov <leonidms@gmail.com> wrote:
> Hi all,
>
> First of all, sorry for my poor English. It's not my native language.
>
> I'm trying to use Lucene to index hierarchical kind of information: I have
> structured html and pdf/word documents and I want to index them in ways to
> perform search in titles, text, paragraphs or tables only, or any
> combinations of items mentioned above. At the moment I see 3 possible
> solutions:
>
>   - Create the set of all possible fields, like: contents, title, heading,
>   table etc... And index the data in all them accordingly. Possible impacts:
>   - a big count of fields
>      - data duplication (because I need to make search looking in the
>      paragraphs to look inside all the inner elements, so every outer element
>      indexed will contain all the inner element content as well)
>   - Create the hierarchy of the fields, like "title", "paragraph/title",
>   "paragraph/title/subparagraph/table". Possible impacts:
>      - count of fields remains the same
>      - soft set of fields (not consistent)
>      - I'm not sure about the ways I could process required information and
>      perform search.
>      - Performance issues?
>      - Use one field for content and just add location prefix to content.
>   For example "contents:*paragraph/heading:*token1 token2". *
>   paragraph/heading:* here is used as additional information prefix. So, I
>   (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
>      - Strong set of index fields (small)
>      - Additional information processing - all the queries I'll use will
>      have to work as PrefixQuery
>      - Performance issues?
>
>
> So, have anyone tried to make things work like that? Or am I trying to use
> wrench to hammer in nails? I assume Lucene wasn't thought to be used like
> that, but it's worth trying (at least asking).
> Any results / suggestions are welcome!
>
> --
> Bests regards,
> Leonid Maslov!
> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
>



-- 
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message