lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@activemath.org>
Subject Re: Passing XML objects to the analyzer ?
Date Wed, 20 Apr 2005 19:50:46 GMT
I don't agree with this if the query is expected to contain the same 
"text-encoding" as the content being analyzed.

So one example would be matching
     f is continuous since it is the product of g and x |-> x^2
(in "email notation", we work with a semantic encoding)
This combination of text and formula (which we represent as OpenMath) 
would be both something that could be queried as a phrase and something 
that would be a content.
Of course, you're left with the query input which is a separate problem 
closer to the controls for content authoring anyways.

One natural effect of XML is the mark of a region as being a reference 
to a named entity such as could be done in HTML with
  <a href="def_continuous">continuous function</a>
Here one would like to match both the "symbol" continuous (that is the 
reference def_continuous as well as the words continuous function.
What I could read thus far about position-increment to offer 
alternatives seems to be limited to one word.

paul


Le 20 avr. 05, à 15:32, Vanlerberghe, Luc a écrit :

> The problem with this approach is that the Analyser you will use for 
> indexing will be *very* different from the one used for searching.
>
> The way I see it, the Document objects pqssed to Lucene should contain 
> fields that are as much text based as possible, comparable to what a 
> user would type while searching.   It's the task of the Analyzer then 
> to break the text up in terms, remove capitals, etc, etc... This 
> should be kept as similar as possible for indexing and searching.
>
> IMHO, only fields that are not Tokenized (like dates or keywords) or 
> fields that are UnIndexed should contain 'raw' data.
>
> Luc
>
>
> -----Original Message-----
> From: Paul Libbrecht [mailto:paul@activemath.org]
> Sent: Tuesday, April 19, 2005 11:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: Passing XML objects to the analyzer ?
>
>
> Le 19 avr. 05, à 22:50, Erik Hatcher a écrit :
>> The only catch that I know if is that an Analyzer is invoked on a
>> per-field basis.  I can't tell exactly what you have in mind, but a
>> Lucene Analyzer cannot split data into separate fields itself - it has
>> to have been split prior.
>
> That's an easy one... ok, yes, I was clearly aware of this.
>
>> I'm indexing a lot of XML myself, with JDOM in the middle, and using
>> XPath to extract data per field before building the Document.
>
> So wouldn't Field.Unstored(Object) actually make sense ?
> That object, instead of being a reader, would be passed around till 
> the analyzer call which would then decide to accept, say, JDOM 
> objects...
>
> paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message