[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842509#action_12842509
]
Michael McCandless commented on LUCENE-2302:
--------------------------------------------
I like that this change would mean indexer has a single getBytes interface for getting the
terms data.
It'd mean the UTF16->UTF8 encoding it now does would move into CharTermAttr, hidden to
the indexer.
So the indexer only ever works with opaque byte[] data for terms.
And it'd mean others can make their own term attrs -- maybe my terms are shorts and I send
2 bytes to indexer per term.
> Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence,
Appendable)
> --------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2302
> URL: https://issues.apache.org/jira/browse/LUCENE-2302
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: Flex Branch
> Reporter: Uwe Schindler
> Fix For: Flex Branch
>
>
> For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute
only supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly
work on the byte[] array.
> Also TermAttribute lacks of some interfaces that would make it simplier for users to
work with them: Appendable and CharSequence
> I propose to create a new interface "CharTermAttribute" with a clean new API that concentrates
on CharSequence and Appendable.
> The implementation class will simply support the old and new interface working on the
same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So if somebody adds a
TermAttribute, he will get an implementation class that can be also used as CharTermAttribute.
As both attributes create the same impl instance both calls to addAttribute are equal. So
a TokenFilter that adds CharTermAttribute to the source will work with the same instance as
the Tokenizer that requested the (deprecated) TermAttribute.
> To also support byte[] only terms like Collation or NumericField needs, a separate getter-only
interface will be added, that returns a reusable BytesRef, e.g. BytesRefGetterAttribute. The
default implementation class will also support this interface. For backwards compatibility
with old self-made-TermAttribute implementations, the indexer will check with hasAttribute(),
if the BytesRef getter interface is there and if not will wrap a old-style TermAttribute (a
deprecated wrapper class will be provided): new BytesRefGetterAttributeWrapper(TermAttribute),
that is used by the indexer then.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
|