lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
Date Fri, 12 Mar 2010 08:29:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844420#action_12844420
] 

Simon Willnauer commented on LUCENE-2309:
-----------------------------------------

The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them
needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate
over tokens. The IndexWriter should instead implement an interface or use a class that is
called for each successful "incrementToken()" no matter how this increment is implemented.

I could imagine a really simple interface like
{code}

interface AttributeConsumer {
  
  void setAttributeSource(AttributeSource src);

  void next();

  void end();

}
{code}

IW would then pass itself or an istance it uses (DocInverterPerField) to an API expecting
such a consumer like:

{code}
field.consume(this);
{code}

or something similar. That way we have not dependency on whatever Attribute producer is used.
The default implementation is for sure based on an analyzer / tokenstream and alternatives
can be exposed via expert API. Even Backwards compatibility could be solved that way easily.

bq. Only tests would rely on the analyzers module. I think that's OK? core itself would have
no dependence.
+1 test dependencies should not block modularization, its just about configuring the classpath
though!



> Fully decouple IndexWriter from analyzers
> -----------------------------------------
>
>                 Key: LUCENE-2309
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2309
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>
> IndexWriter only needs an AttributeSource to do indexing.
> Yet, today, it interacts with Field instances, holds a private
> analyzers, invokes analyzer.reusableTokenStream, has to deal with a
> wide variety (it's not analyzed; it is analyzed but it's a Reader,
> String; it's pre-analyzed).
> I'd like to have IW only interact with attr sources that already
> arrived with the fields.  This would be a powerful decoupling -- it
> means others are free to make their own attr sources.
> They need not even use any of Lucene's analysis impls; eg they can
> integrate to other things like [OpenPipeline|http://www.openpipeline.org].
> Or make something completely custom.
> LUCENE-2302 is already a big step towards this: it makes IW agnostic
> about which attr is "the term", and only requires that it provide a
> BytesRef (for flex).
> Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
> FieldType knows the analyzer to use, then we could simply create a
> getAttrSource() method (say) on it and move all the logic IW has today
> onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message