lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brandon Jockman" <brand...@isogen.com>
Subject Re: Search on XML files
Date Mon, 13 May 2002 15:31:28 GMT
Fanny,

The current implementation allows for searching on:

a.. the entire PCDATA content of an XML document.
b.. the PCDATA content within specific elements.
c.. processing instructions by name and content.
d.. attributes of elements by both name and value.
e.. elements/PIs with specific parent element types.
f.. elements/PIs at specific child locations within a parent element.
g.. elements/PIs with specific ancestor element types.
h.. elements/PIs with specifically ordered ancestor element type.

The original need we had for XML contextual searching was to find a specific
document that contained a particular element with particular content, and in
relationships to other element types.

Currently, searching for a document based on content of two separate
elements with a logical AND relationship is not provided. However, the OR
relationship should work just fine.

There is a field stored that contains all text content for the document, but
that probably isn't enough for what you need.

Each lucene document from the same XML document has a 'docid' field.


You have two real options:

1. Write a queryparser that inherits from the Lucene one that detects the
relationship and performs more than one search, grouping results based on
document id.

Searching for X and Y would become:
1. Search for X -> Hits_X
2. Search for Y -> Hits_Y
3. Merge Hits_X and Hits_Y based on docid.

-=-

2. Write a queryparser that inherits from the lucene one, detects that you
are searching for a document based on several elements, as opposed to a
single one, and converts the search from:

X AND Y

to:

(X AND docid:docidentifier) OR (Y AND docid:docidentifier)

..and then merge results based on docid.


You may also be able to leverage the search 'Filtering' mechanism, but I'm
not experienced with that...

<<<From FAQ>>>
16. What is filtering and how is it performed ?
Filtering means imposing additional restriction on the hit list to eliminate
hits that otherwise would be included in the search results. There are two
ways to filter hits:

  a.. Search Query - in this approach, provide your custom filter object to
the when you call the search() method. This filter will be called exactly
once to evaluate every document that resulted in non zero score.
  b.. Selective Collection - in this approach you perform the regular search
and when you get back the hit list, collect only those that matches your
filtering criteria. In this approach, your filter is called only for hits
that returned by the search method which may be only a subset of the non
zero matches (useful when evaluating your search filter is expensive).
<<< ... >>>

> 1. What the query string suppose to be if I want to get records which
> contain (Austalia and 20020415) or (HongKong and 20020315)?

((Australia +tagname:country) AND (+tagname:date +20020415)) OR ((HongKong
+tagname:country) AND (tagname:date +20020415))

> 2. What the query string suppose to be if I want to get records which
> contain (Australia and 20020415) and (not (HongKong and 20020315))?

((Australia +tagname:country) AND (+tagname:date +20020415))  AND
(( tagname:country HongKong) AND (tagname:date 20020415))

Either of these queries will require the additional functionality outlined
in options 1 or 2 above.


Regards,

-Brandon

Brandon Jockman
ISOGEN International, LLC.
brandonj@isogen.com



----- Original Message -----
From: "Fanny Yeung" <toffeem@hotmail.com>
To: <lucene-user@jakarta.apache.org>
Sent: Monday, May 13, 2002 7:48 AM
Subject: Search on XML files


> Hi,
>
> Does anyone know how to make up the query for multiple fields search on
XML
> files in the sample provided by isogen? Does it support?
>
> I would like to get all the results which contain the value of 'Australia'
> in tag 'country' AND the date is '20020415' in the tag 'date'. I always
get
> 0 hit count. Any problem of my query string?
>
> +(Australia AND tagname:country) AND +(20020415 AND tagname:date)
>
> 1. What the query string suppose to be if I want to get records which
> contain (Austalia and 20020415) or (HongKong and 20020315)?
> 2. What the query string suppose to be if I want to get records which
> contain (Australia and 20020415) and (not (HongKong and 20020315))?
>
> Since I am a newbie on Lucene, I am wonder whether I can use filter to
> restricts the search results? In my case, I need to retrieve all the news
> between a date range (for example, 20020102 to 20020330). In addition, the
> result should only contains those news that have been subscribed  . Should
I
> use filter to filter out the unsubscribed news? Or I should make up a
query
> string to include those subscribed news? Which approach is better in terms
> of performance?
>
> Thanks in advance.
>
>
> Fanny
>
> _________________________________________________________________
> MSN Photos is the easiest way to share and print your photos:
> http://photos.msn.com/support/worldwide.aspx
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message