lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Cowan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1813) Add option to ReverseStringFilter to mark reversed tokens
Date Sun, 16 Aug 2009 23:30:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743926#action_12743926
] 

Paul Cowan commented on LUCENE-1813:
------------------------------------

Very very minor thing, but does it make more sense to choose a more suitable character? U+0001
is an assigned character, with some semantic meaning ("Start of Heading", same as ASCII character
0x01) which isn't really relevant to this use. It mightn't be a bad idea to (a) choose a control
character which makes sense in context, if there is one (I can't see one, myself), (b) using
a character from the private-use area (U+E000 to U+F8FF) or (c) my preferred option, using
the Unicode tag characters. The tag characters are designed for just such a purpose.. embedding
contextual metadata in text fields. The general syntax for a tag is <TAG TYPE> followed
by one or more <TAG CHARACTER>s. Unfortunately, only one tag type is defined in unicode
at present (language tag), which isn't suitable.

That said, I think it makes sense (and is probably 'nicer') to pick one of the Unicode tag
characters -- say, U+E0052 TAG LATIN CAPITAL LETTER R (for 'reverse') and use that. This could
lead to a de facto standard for Lucene fields, where different variations of the same token
could use different leading tag characters. Rather than just everyone picking a character
at random, this could lead to some sort of structure around similar situations (i.e. I could
envisage a filter which uses U+E004E TAG LATIN CAPITAL LETTER N for a normalised version of
the token, etc). 

Sorry, I'm really anal about Unicode. Can't help it.

> Add option to ReverseStringFilter to mark reversed tokens
> ---------------------------------------------------------
>
>                 Key: LUCENE-1813
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1813
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 2.9
>
>         Attachments: reverseMark-2.patch, reverseMark.patch
>
>
> This patch implements additional functionality in the filter to "mark" reversed tokens
with a special marker character (Unicode 0001). This is useful when indexing both straight
and reversed tokens (e.g. to implement efficient leading wildcards search).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message