lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Document aware analyzers was Re: deprecating Versions
Date Wed, 01 Dec 2010 19:40:34 GMT
On Wed, Dec 1, 2010 at 2:25 PM, Grant Ingersoll <gsingers@apache.org> wrote:
>
> Nah, I just meant analysis would often benefit from having knowledge of the document
as a whole instead of just the individual field.
>

and analysis would suffer from this too, because right now these
things are independent and we have a fast simple reusable model.
I'd prefer to keep the TokenStream analysis api... but as we have
discussed on the list, it would be nice to minimize the interface
between analysis components and indexer/queryparser so you can use an
*alternative* API... we are working in this direction already.

>>
>> Maybe if you give a concrete example then I would have a better
>> understanding of the problem you think this might solve.
>
> Let me see if I can put some flesh on the bones.  I'm assuming the raw document has
already been parsed and that we are still basically dealing with strings and that we have
a document which contains one or more fields.
>
> If we step back and look at our analysis process, there are some things that are easy
and some things that are hard that maybe shouldn't be because even though we talk like we
are indexing and search documents, we are really indexing and searching fields and everything
is Field centric.  That works fine for the easy analysis things like tokenization, stemming,
lowercasing, etc. when all the content is in one language.  It doesn't work well when you
have multiple languages in a single document or if you want to do things like Tee/Sink or
even something as simple as Solr's copy field semantics.

Well i have trouble with a few of your examples: "want to use
Tee/Sink" doesn't work for me... its a description of an XY problem to
me... i've never needed to use it, and its rarely discussed on the
user list...

As far as working with a lot of languages, i understand this issue
much more... but i've never much had a desire for this, especially
given the fact that "Query is a document too"... I'm personally not a
fan of language detection,
and I don't think it belongs in our analysis API: like encoding
detection and other similar heuristics, its part of document parsing
to me!

As I said before, I think our TokenStream analysis API is already
quite complicated and I dont think we should make it more complicated
for these reasons (especially since these examples are quite vague and
i'm still not sure you cannot solve them easier in another way.

If you want to use a more complicated analysis API that doesnt work
like TokenStreams but instead incorporates things that are document
parsing or whatever, i guess you should be able to do that. I'm not
sure Lucene should provide such an API, but we shouldn't force you to
use the TokenStreams API either.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message