lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Document aware analyzers was Re: deprecating Versions
Date Wed, 01 Dec 2010 19:25:55 GMT

On Dec 1, 2010, at 8:07 AM, Robert Muir wrote:

> On Wed, Dec 1, 2010 at 8:01 AM, Grant Ingersoll <gsingers@apache.org> wrote:
> 
>> While we are at it, how about we make the Analysis process document aware instead
of Field aware?  The PerFieldAnalyzerWrapper, while doing exactly what it says it does, is
just silly.  If you had an analysis process that was aware, if it chooses to be, of the document
as a whole then you open up a whole lot more opportunity for doing interesting analysis while
losing nothing towards the individual treatment of fields.  The TeeSink stuff is an attempt
at this, but it is not sufficient.
>> 
> 
> I'm not sure I like this: traditionally we let the user application
> deal with "document parsing" (how do you take your content and define
> it as documents/fields).

Nah, I just meant analysis would often benefit from having knowledge of the document as a
whole instead of just the individual field.  

> 
> If we want to change lucene to start dealing with this "document
> parsing" aspect, thats pretty scary in itself, but in my opinion the
> very last choice of where we would want to add something like that is
> analysis! So personally I really like analysis being separate from
> document parsing: our analysis API is already way too complicated.

Yes, I agree.


> 
> Maybe if you give a concrete example then I would have a better
> understanding of the problem you think this might solve.

Let me see if I can put some flesh on the bones.  I'm assuming the raw document has already
been parsed and that we are still basically dealing with strings and that we have a document
which contains one or more fields.

If we step back and look at our analysis process, there are some things that are easy and
some things that are hard that maybe shouldn't be because even though we talk like we are
indexing and search documents, we are really indexing and searching fields and everything
is Field centric.  That works fine for the easy analysis things like tokenization, stemming,
lowercasing, etc. when all the content is in one language.  It doesn't work well when you
have multiple languages in a single document or if you want to do things like Tee/Sink or
even something as simple as Solr's copy field semantics.  The fact that we have PerFieldAnalyzerWrapper
is a symptom of this.  The clunkiness of the TeeSinkTokenFilter is also another one.  Handling
auto language identification is another.  The end result of all of these things is you often
have to do analysis work twice (or more) for the same piece of content when I believe that
an analysis process that knew a document had multiple fields (which seems like a given) might
lead to more efficiencies because repeated analysis work could be shared and also because
work that inherently crosses multiple fields on the same document or selects a particular
field out of a choice of several can be handled more cleanly.

So, you as the developer would still need to define out what your fields are and what analysis
you want done for each of those fields, but we, as Lucene developers, might be able to make
things more efficient if we can recognize commonalities, etc. as well as offer users tools
that make it easy to work across fields.

At any rate, this is all just food for thought.  I don't have any proposed API changes at
this point.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message