lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields
Date Fri, 09 Mar 2007 16:57:09 GMT


Hoss Man commented on LUCENE-252:

I'm afraid i'm still not understanding the issue in nutch, it seems like the root of hte problem

> ... We use the tokenized url field in the FieldCache ...

...if you know this field is tokenized, don't use it this way.  if you want to use it this
way, index it a second time untokenized.

At a more practical level:

1) the change you propose to getStrings and getStringIndex is not practical because as we've
discussed before, a field being tokenized isn't a garuntee that FieldCache won't work -- isTokenized
just inidcates that an Analyzer was used -- it doesn't indicate that any real tokenization
took place (the analyzer might have just been used to lowercase the field value before indexing,
or strip off leading/trailing white space) that doesn't mean the normal FieldCache can't be
used for sorting.  the converse is also true: !isTokenized doens't tell you that it's safe
to build the FieldCache -- even if no Analyzer is ever used, multiple Field values can be
added for the same field -- and that is hte root cause of hte problem, not tokenization but
multiple terms for a given field.

2) the desired behavior you are requesting in a StoredFieldCacheImpl could be done without
making any changes to what so ever to FieldCacheImpl -- since nutch knows exactly which fields
it's indexing multiple tokens for, it can make the choice between using a StoredFieldCacheImple
or using a FieldCacheImpl. (but as i've said, i really don't think that's the right solution)

> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>                 Key: LUCENE-252
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.4
>         Environment: Operating System: other
> Platform: All
>            Reporter: Aviran Mordo
>         Assigned To: Lucene Developers
>         Attachments: dif.txt, FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch,
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off 
> especially with more then one word in the field. I think it is much 
> more logical to sort by field's string value if the sort field is Tokenized and
> stored. This way you'll get the CORRECT sort order

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message