lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)
Date Sun, 07 Mar 2010 23:00:28 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-2302:
----------------------------------

    Description: 
For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute only
supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly work
on the byte[] array.
Also TermAttribute lacks of some interfaces that would make it simplier for users to work
with them: Appendable and CharSequence

I propose to create a new interface "CharTermAttribute" with a clean new API that concentrates
on CharSequence and Appendable.
The implementation class will simply support the old and new interface working on the same
term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So if somebody adds a TermAttribute,
he will get an implementation class that can be also used as CharTermAttribute. As both attributes
create the same impl instance both calls to addAttribute are equal. So a TokenFilter that
adds CharTermAttribute to the source will work with the same instance as the Tokenizer that
requested the (deprecated) TermAttribute.

To also support byte[] only terms like Collation or NumericField needs, a separate getter-only
interface will be added, that returns a reusable BytesRef, e.g. BytesRefGetterAttribute. The
default implementation class will also support this interface. For backwards compatibility
with old self-made-TermAttribute implementations, the indexer will check with hasAttribute(),
if the BytesRef getter interface is there and if not will wrap a old-style TermAttribute (a
deprecated wrapper class will be provided): new BytesRefGetterAttributeWrapper(TermAttribute),
that is used by the indexer then.

  was:
For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute only
supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly work
on the byte[] array.
Also TermAttribute lacks of some interfaces that would make it simplier for users to work
with them: Appendable and CharSequence

I propose to create a new interface "ExtendedTermAttribute extends TermAttribute". The corresponding
-Impl class is always an implementation that extends ExtendedTermAttribute . So if somebody
adds a TermAttribute an AttributeSource he will get an implementation class that can be also
used as TermAttribute2. As both attributes create the same impl instance both calls to addAttribute
are equal. So a TokenFilter that adds ExtendedTermAttribute to the source will work with the
same instance as the Tokenizer that requested the (deprecated) TermAttribute.

To support both byte[] and char[] the internals will be implemented like Token in 2.9: Support
for String and char[]. So the buffers are both available, but you can only use one of them.
as soon as you call getByteBuffer(), and the char[] buffer is used, it will be transformed.
So the inder will always call getBytes() and get the UTF-8 bytes. NumericTokenStream will
modify the byte[] directly and if no filter that uses char[] is plugged on top, the buffer
is never transformed.

This issue will also convert the rest of NRQ to byte[] and deprecate all old methods in NumericUtils.
NRQ will directly request ByteRef from splitRange and so on.


> Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence,
Appendable)
> --------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2302
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2302
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: Flex Branch
>            Reporter: Uwe Schindler
>             Fix For: Flex Branch
>
>
> For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute
only supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly
work on the byte[] array.
> Also TermAttribute lacks of some interfaces that would make it simplier for users to
work with them: Appendable and CharSequence
> I propose to create a new interface "CharTermAttribute" with a clean new API that concentrates
on CharSequence and Appendable.
> The implementation class will simply support the old and new interface working on the
same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So if somebody adds a
TermAttribute, he will get an implementation class that can be also used as CharTermAttribute.
As both attributes create the same impl instance both calls to addAttribute are equal. So
a TokenFilter that adds CharTermAttribute to the source will work with the same instance as
the Tokenizer that requested the (deprecated) TermAttribute.
> To also support byte[] only terms like Collation or NumericField needs, a separate getter-only
interface will be added, that returns a reusable BytesRef, e.g. BytesRefGetterAttribute. The
default implementation class will also support this interface. For backwards compatibility
with old self-made-TermAttribute implementations, the indexer will check with hasAttribute(),
if the BytesRef getter interface is there and if not will wrap a old-style TermAttribute (a
deprecated wrapper class will be provided): new BytesRefGetterAttributeWrapper(TermAttribute),
that is used by the indexer then.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message