lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields
Date Thu, 08 Mar 2007 16:50:24 GMT


Hoss Man commented on LUCENE-252:

definitely in agreement with yonik here, erroring out if "docField.isTokenized()" would prevent
some perfectly valid use cases ... my point was that hte current test of "if (t >= mterms.length)"
only triggers an error if htere are more total terms in the field then there are documents
in the index ... but there can be plenty of situations where a doc has more then one indexed
term, but the total number of indexed terms is less hten the number of documents, a better
test would be to check and see if we have already recorded a term for this doc.

I have to say: I'm really not understanding how the current behavior is hindering nutch ...
my understanding of the nutch model is that the set of fields is very well known -- why do
you need to rely on FieldCache being smart enough to stop you from trying to sort on a tokenized
field? (and what does that have to do with deleting duplicates?)

if nothing else: if nutch needs to prevent using FieldCache based sorting on tokenized fields,
why can't the "if (docField.isTokenized())" logic be done outside of the FieldCacheImpl ...
possibly as a way to decide if you want to use the basic sorting or use something like LUCENE-769?

...perhaps this is something that should be discussed more on java-dev?

> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>                 Key: LUCENE-252
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.4
>         Environment: Operating System: other
> Platform: All
>            Reporter: Aviran Mordo
>         Assigned To: Lucene Developers
>         Attachments: dif.txt, FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch,
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off 
> especially with more then one word in the field. I think it is much 
> more logical to sort by field's string value if the sort field is Tokenized and
> stored. This way you'll get the CORRECT sort order

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message