lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields
Date Tue, 06 Mar 2007 12:20:24 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Enis Soztutar updated LUCENE-252:
---------------------------------

    Attachment: FieldCacheImpl_Tokenized_fields_lucene_2.0.patch

In the project Nutch, we have encountered a subtle bug, which I tracked down and found to
be related to unintuitive caching in tokenized fields. 

nutch uses several index servers, and the search results from these servers are merged using
a dedup field for for deleting dupilcates. The values from this field is cached by FieldCachImpl.
The default is the site field, which is indexed and tokenized. However for a Tokenized Field
(for example "url" in nutch), FieldCacheImpl returns an array of Terms rather that array of
field values, so dedup'ing becomes faulty. 

Current FieldCache implementation does not respect tokenized fields, and as described above
caches only terms. I have ported the previous patch and improved it for the 2.0 branch. And
i will write a patch for the trunk. 

I am voting for this patch to be committed. 

> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>
>                 Key: LUCENE-252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-252
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.4
>         Environment: Operating System: other
> Platform: All
>            Reporter: Aviran Mordo
>         Assigned To: Lucene Developers
>         Attachments: dif.txt, FieldCacheImpl_Tokenized_fields_lucene_2.0.patch
>
>
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off 
> especially with more then one word in the field. I think it is much 
> more logical to sort by field's string value if the sort field is Tokenized and
> stored. This way you'll get the CORRECT sort order

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message