lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven A Rowe" <sar...@syr.edu>
Subject RE: Specialized XML handling in Lucene
Date Tue, 11 Mar 2008 15:48:24 GMT
Hi Eran, see my comments below inline:

On 03/11/2008 at 9:23 AM, Eran Sevi wrote:
> I would like to ask for suggestions of the best design for
> the following scenario:
> 
> I have a very large number of XML files (around 1M).
> Each file contains several sections. Each section contains
> many elements (about 1000-5000).
> Each element has a value and some attributes describing the
> value (like
> metadata), for example:
> 
> <Section1>
>     <Element1  id="0"  type="A"  meta1="val11"
>                meta2="val21">value1</Element1>
>     <Element1  id="1"  type="B"  meta1="val12" 
>                meta2="val21">value2</Element1>
> ...
> </Section1>
> <Section2>
>     <Element2 id="0"  type="D"  meta1="val11"
>               meta3="val31">value3</Element2>
>     <Element2 id="1"  type="B"  meta1="val13"
>               meta3="val34">value1</Element2>
> ...
> <Section2>
> ...
> 
> As you can see, each attribute can have any value, and
> attribute names can be the same in different sections.
> 
> I would like to index the XML in such a way so I can perform
> queries like:
> 
> element1=value1 AND type=A AND meta2=val21
> 
> and also more complicated queries that include positions
> between elements, and even range queries on attribute values.
> 
> Indexing each element as a different document might not be
> possible because of the large number of documents it might
> create (more then 5 billion docs), and might also make it
> difficult to parse results - I still want to know how
> many original XML documents contains the searched terms.

5 billion docs is within the range that Lucene can handle.  I think you should try doc = element
and see how well it works.

In order to know which original documents your hits come from, add an "xml_doc_id" field,
and collect the hits' xml_doc_id values in a set, then take the set's cardinality.

> Indexing each attribute as a different field is also
> difficult because I then need the positional information
> of the found terms and check that they were all found in
> the same place (and thus "belong" to the same element).

You could use an XPath(-ish, depending on requirements) field that represents the element
location, e.g.:

<Section1>
  <Element1 id="0" type="A" meta1="val11" meta2="val21">value1</Element1>
  <Element1 id="1" type="B" meta1="val12" meta2="val21">value2</Element1>
  ...
</Section1>

==> 

Lucene Document field-name:value

 doc #1
       xml_doc_id:1
            xpath:/Section1/Element1[1]
               id:0
             type:A
            meta1:val11
            meta2:val21
            value:value1

 doc #2
       xml_doc_id:1
            xpath:/Section1/Element1[2]
               id:1
             type:B
            meta1:val12
            meta2:val21
            value:value2

Hope it helps,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message