lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Cowan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1813) Add option to ReverseStringFilter to mark reversed tokens
Date Mon, 17 Aug 2009 00:28:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743937#action_12743937
] 

Paul Cowan commented on LUCENE-1813:
------------------------------------

OK, cool. I'm taking an interest in this purely because I have some ideas for other token
filters which would do something similar, and really like the idea of tagging them in the
same way just with different 'headers'. It would be really beneficial, I think, to come up
with something that can be reused and, more importantly, combined (so different filters don't
'clash' with their output). What about making it 2 characters, at least? 

U+0001 START OF HEADER
U+xxxx whatever you like to indicate 'reversing' (i.e. an 'R', or just a 0-byte as this is
the first purpose allocated, or whatever)

This adds 2 bytes to each term, not 1, but terms generally don't take up that much room in
the scale of a whole index and I think it's worth the flexibility. Hell, if you're willing
to use 3 (that IS starting to seem wasteful, I admit) then maybe

U+0001 START OF HEADER
U+xxxx whatever
U+0002 START OF TEXT

That's at least semantically meaningful. Other ideas, just looking at the ASCII control characters:

U+xxxx whatever
U+001F UNIT SEPARATOR

or

U+000E SHIFT OUT
U+xxxx whatever
U+000F SHIFT IN

I don't really mind, but it's always nice to plan ahead.

> Add option to ReverseStringFilter to mark reversed tokens
> ---------------------------------------------------------
>
>                 Key: LUCENE-1813
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1813
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 2.9
>
>         Attachments: reverseMark-2.patch, reverseMark.patch
>
>
> This patch implements additional functionality in the filter to "mark" reversed tokens
with a special marker character (Unicode 0001). This is useful when indexing both straight
and reversed tokens (e.g. to implement efficient leading wildcards search).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message