lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Collator-based facet sorting in Solr
Date Tue, 11 Sep 2012 15:23:10 GMT
Just a concern where things could act a little funky:

today for example, If I set strength=primary, then its going to fold
Test and test to the same unique term,
but under this scheme you would have <bytes>Test and <bytes>test as two terms.

this could be undesirable in the typical case that you just want
case-insensitive facets: but we don't provide
any way to preprocess the text to avoid this.

Really a lot of this is because factory-based analysis chains have no
way to specify the AttributeFactory,
e.g. i guess if we really wanted to fix this right we would need to
pass in the AttributeFactory to TokenizerFactory's create() method.

But for now from Solr it would be a little hacky, e.g. someone is
gonna have to fold the case client-side or whatever
if they don't want these problems.


On Tue, Sep 11, 2012 at 10:43 AM, Toke Eskildsen <te@statsbiblioteket.dk> wrote:
> Claudio Ranieri and I briefly discussed collator based sorting for
> facets in the thread "Problem with accented words sorting" on the
> solr-user mailing list. Here's the idea:
>
> Solr faceting supports sorting by either count or index order. Claudio
> and I both need the order to be collator-based. My understanding of the
> issue is that it is not currently possible.
>
> Collator-based document sorting in Solr uses CollationKeys as field
> values. This does not work with faceting on fields with multiple values
> as there is no mapping from the key to the human readable value.
>
> ICU sort keys are always null (00) terminated and when two keys are
> compared, the comparison stops as soon as null is reached(?)
> http://userguide.icu-project.org/collation/architecture
>
> If we concatenate the keys with the original values:
> <key><00><original value><offset of original value>
> we get an entity where the ordering is still correct upon comparison and
> where the original value can be extracted by using the offset from the
> last int (or maybe short, to spare 2 bytes) in the BytesRef.
>
> If the idea is sound, I'll open a JIRA issue. Unfortunately I do not
> have time right now for hacking on it.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>



-- 
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message