lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten F." <karsten-luc...@fiz-technik.de>
Subject Re: Newbie question: using Lucene to index hierarchical information.
Date Tue, 02 Sep 2008 08:46:36 GMT

Hi Leonid,

what kind of query is your use case?

Comlex scenario:
You need all the hierarchical structure information in one query. This means
you want to search with xpath in a real xml-Database. (like: All Documents
with a subtitle XY which contains directly after this subtitle a table with
the same column like ...)

Normal scenario:
You want to search for only one part of your hierarchical information like
'Document with word xy in title' and 'Documents with word xy in table'.

I am not familiar with lucene use in xml-Databases, but I can advice for
"normal scenario":

Take a look to the xml-aware search in xtf (
http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7
).
The idea is to use one lucene-document for each section with only two
fields: "text" and "sectionType".
But to collect all hits belonging to one hierarchical information (e.g. one
html-File) and compress this to one representative hit in lucene.

Best regards
  Karsten


leonardinius wrote:
> 
> Any comments, suggestions? Maybe I should rephrase my original message or
> describe it in detail?
> I really would like to get any response if possible.
> 
> Thanks a lot in advance!
> 
> On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <leonidms@gmail.com> wrote:
> 
>> Hi all,
>>
>> First of all, sorry for my poor English. It's not my native language.
>>
>> I'm trying to use Lucene to index hierarchical kind of information: I
>> have
>> structured html and pdf/word documents and I want to index them in ways
>> to
>> perform search in titles, text, paragraphs or tables only, or any
>> combinations of items mentioned above. At the moment I see 3 possible
>> solutions:
>>
>>    - Create the set of all possible fields, like: contents, title,
>>    heading, table etc... And index the data in all them accordingly.
>> Possible
>>    impacts:
>>    - a big count of fields
>>       - data duplication (because I need to make search looking in the
>>       paragraphs to look inside all the inner elements, so every outer
>> element
>>       indexed will contain all the inner element content as well)
>>    - Create the hierarchy of the fields, like "title", "paragraph/title",
>>    "paragraph/title/subparagraph/table". Possible impacts:
>>       - count of fields remains the same
>>       - soft set of fields (not consistent)
>>       - I'm not sure about the ways I could process required information
>>       and perform search.
>>       - Performance issues?
>>       - Use one field for content and just add location prefix to
>> content.
>>    For example "contents:*paragraph/heading:*token1 token2". *
>>    paragraph/heading:* here is used as additional information prefix. So,
>>    I (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
>>       - Strong set of index fields (small)
>>       - Additional information processing - all the queries I'll use will
>>       have to work as PrefixQuery
>>       - Performance issues?
>>
>>
>> So, have anyone tried to make things work like that? Or am I trying to
>> use
>> wrench to hammer in nails? I assume Lucene wasn't thought to be used like
>> that, but it's worth trying (at least asking).
>> Any results / suggestions are welcome!
>>
>> --
>> Bests regards,
>> Leonid Maslov!
>> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
>>
> 
> 
> 
> -- 
> Bests regards,
> Leonid Maslov!
> Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."
> 
> 

-- 
View this message in context: http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19266355.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message