lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Cowan (JIRA)" <>
Subject [jira] Commented: (LUCENE-1813) Add option to ReverseStringFilter to mark reversed tokens
Date Sun, 16 Aug 2009 23:30:14 GMT


Paul Cowan commented on LUCENE-1813:

Very very minor thing, but does it make more sense to choose a more suitable character? U+0001
is an assigned character, with some semantic meaning ("Start of Heading", same as ASCII character
0x01) which isn't really relevant to this use. It mightn't be a bad idea to (a) choose a control
character which makes sense in context, if there is one (I can't see one, myself), (b) using
a character from the private-use area (U+E000 to U+F8FF) or (c) my preferred option, using
the Unicode tag characters. The tag characters are designed for just such a purpose.. embedding
contextual metadata in text fields. The general syntax for a tag is <TAG TYPE> followed
by one or more <TAG CHARACTER>s. Unfortunately, only one tag type is defined in unicode
at present (language tag), which isn't suitable.

That said, I think it makes sense (and is probably 'nicer') to pick one of the Unicode tag
characters -- say, U+E0052 TAG LATIN CAPITAL LETTER R (for 'reverse') and use that. This could
lead to a de facto standard for Lucene fields, where different variations of the same token
could use different leading tag characters. Rather than just everyone picking a character
at random, this could lead to some sort of structure around similar situations (i.e. I could
envisage a filter which uses U+E004E TAG LATIN CAPITAL LETTER N for a normalised version of
the token, etc). 

Sorry, I'm really anal about Unicode. Can't help it.

> Add option to ReverseStringFilter to mark reversed tokens
> ---------------------------------------------------------
>                 Key: LUCENE-1813
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 2.9
>         Attachments: reverseMark-2.patch, reverseMark.patch
> This patch implements additional functionality in the filter to "mark" reversed tokens
with a special marker character (Unicode 0001). This is useful when indexing both straight
and reversed tokens (e.g. to implement efficient leading wildcards search).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message