lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <>
Subject Re: Passing XML objects to the analyzer ?
Date Wed, 20 Apr 2005 19:50:46 GMT
I don't agree with this if the query is expected to contain the same 
"text-encoding" as the content being analyzed.

So one example would be matching
     f is continuous since it is the product of g and x |-> x^2
(in "email notation", we work with a semantic encoding)
This combination of text and formula (which we represent as OpenMath) 
would be both something that could be queried as a phrase and something 
that would be a content.
Of course, you're left with the query input which is a separate problem 
closer to the controls for content authoring anyways.

One natural effect of XML is the mark of a region as being a reference 
to a named entity such as could be done in HTML with
  <a href="def_continuous">continuous function</a>
Here one would like to match both the "symbol" continuous (that is the 
reference def_continuous as well as the words continuous function.
What I could read thus far about position-increment to offer 
alternatives seems to be limited to one word.


Le 20 avr. 05, à 15:32, Vanlerberghe, Luc a écrit :

> The problem with this approach is that the Analyser you will use for 
> indexing will be *very* different from the one used for searching.
> The way I see it, the Document objects pqssed to Lucene should contain 
> fields that are as much text based as possible, comparable to what a 
> user would type while searching.   It's the task of the Analyzer then 
> to break the text up in terms, remove capitals, etc, etc... This 
> should be kept as similar as possible for indexing and searching.
> IMHO, only fields that are not Tokenized (like dates or keywords) or 
> fields that are UnIndexed should contain 'raw' data.
> Luc
> -----Original Message-----
> From: Paul Libbrecht []
> Sent: Tuesday, April 19, 2005 11:44 PM
> To:
> Subject: Re: Passing XML objects to the analyzer ?
> Le 19 avr. 05, à 22:50, Erik Hatcher a écrit :
>> The only catch that I know if is that an Analyzer is invoked on a
>> per-field basis.  I can't tell exactly what you have in mind, but a
>> Lucene Analyzer cannot split data into separate fields itself - it has
>> to have been split prior.
> That's an easy one... ok, yes, I was clearly aware of this.
>> I'm indexing a lot of XML myself, with JDOM in the middle, and using
>> XPath to extract data per field before building the Document.
> So wouldn't Field.Unstored(Object) actually make sense ?
> That object, instead of being a reader, would be passed around till 
> the analyzer call which would then decide to accept, say, JDOM 
> objects...
> paul
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message