lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3071) PathHierarchyTokenizer adaptation for urls: splits reversed
Date Thu, 05 May 2011 16:29:03 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029416#comment-13029416
] 

Robert Muir commented on LUCENE-3071:
-------------------------------------

bq. Can you help me with the purpose of finalOffset? Or can I simply skip it in my tests (they
are working if I skip it)?

The finalOffset is supposed to be the offset of the entire document, this is useful so that
offsets are correct on multivalued fields.

Example multivalued field "foo" with two values:
"bar " <-- this one ends with a space
"baz"

With a whitespace tokenizer, value 1 will have a single token "bar" with startOffset=0, endOffset=3.
But, finalOffset needs to be 4 (essentially however many chars you read in from the Reader)

This way, the offsets will then accumulate correctly for "baz".


> PathHierarchyTokenizer adaptation for urls: splits reversed
> -----------------------------------------------------------
>
>                 Key: LUCENE-3071
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3071
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Olivier Favre
>            Priority: Minor
>         Attachments: LUCENE-3071.patch, ant.log.tar.bz2
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> {{PathHierarchyTokenizer}} should be usable to split urls the a "reversed" way (useful
for faceted search against urls):
> {{www.site.com}} -> {{www.site.com, site.com, com}}
> Moreover, it should be able to skip a given number of first (or last, if reversed) tokens:
> {{/usr/share/doc/somesoftware/INTERESTING/PART}}
> Should give with 4 tokens skipped:
> {{INTERESTING}}
> {{INTERESTING/PART}}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message