lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-2309) Fully decouple IndexWriter from analyzers
Date Fri, 12 Mar 2010 13:26:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844489#action_12844489
] 

Uwe Schindler edited comment on LUCENE-2309 at 3/12/10 1:25 PM:
----------------------------------------------------------------

bq. I could imagine a really simple interface like

During lunch an idea evolved:

If you look at current DocInverter code, it does not use a consumer-like API. The code just
has an add/accept-method that accepts tokens. The idea is to, as Simon proposed, let the docinverter
implement something like AttributeAcceptor. But still we must have the attribute api and the
acceptor (DocInverter) must always "see" the same attribute instances (else much time would
be spent to each time call getAttribute(...) for each token, if the accept method would take
an AttributeSource).

The current TokenStream api could get a method taking AttributeAcceptor and simply do a while
incrementToken() loop, calling accept() on DocInverter (the AttributeAcceptor). Another approach
for users would be to not use the TokenStream API at all and simply call the accept() method
for each token on the Acceptor.

But both approaches still have te problem with the shared attributes. If you want to "record"
tokens you have to implement something like my Proxy attributes. Else (as mentioned) above,
most time would be spent in getAttribute() calls.

      was (Author: thetaphi):
    bq. I could imagine a really simple interface like

During lunch an idea evolved:

If you look at current DocInverter code, it does not use a consumer-like API. The code just
has an add/accept-method that accepts tokens. The idea is to, as Simon proposed, let the docinverter
implement something like AttributeAcceptor. But still we must have the attribute api and the
acceptor (DocInverter) must always see the same attribute instances (else much time would
be spent to each time call getAttribute(...) for each token, if the accept method would take
an AttributeSource.

The current TokenStream api could get a method taking AttributeAcceptor and simply do a while
incrementToken() loop, calling accept() on DocInverter (the AttributeAcceptor). Another approach
for users would be to not use the TokenStream API at all and simply call the accept() method
for each token.
  
> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>
>                 Key: LUCENE-2309
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2309
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message