lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <>
Subject [jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields
Date Thu, 08 Mar 2007 08:25:24 GMT


Enis Soztutar commented on LUCENE-252:

Well, I have spent half a day to find this issue with tokenized field caching, so I absolutely
agree on throwing and exception in the getStrings() and getStringIndex() functions of FieldCacheImpl.
A snippet would be like : 

Field docField = getField(reader, field);
      if (docField != null && docField.isStored() && docField.isTokenized())
           throw new RuntimeException("Caching in Tokenized Fields is not allowed");

Looking at the timing of cache building tokenized fields are really slow, as Doug mentioned,
for a 1.5M real index(from web documents) building the cache on a tokenized field takes 1600
ms on the avarage, but for an untokenized field, it takes 30000 ms on avarage. 

In nutch we have 3 options : 1st is to disallow deleting duplicates on tokenized fields(due
to FieldCache), 2nd is to index the tokenized field twice(once tokenized, and once untokenized),
3rd use the above patch and warm the cache initially in the index servers. 

I am in favor of the 3rd option and believe that this patch is necessary and it can be included
with an explanatory javadoc. 
another option will be to extend the defalut FieldCacheImpl and allow for tokenized field
caching and naming the class similar to LUCENE-769's such as StoredFieldCacheImpl. If that
is ok, i can prepare a patch and send it here. 

> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>                 Key: LUCENE-252
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.4
>         Environment: Operating System: other
> Platform: All
>            Reporter: Aviran Mordo
>         Assigned To: Lucene Developers
>         Attachments: dif.txt, FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch,
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off 
> especially with more then one word in the field. I think it is much 
> more logical to sort by field's string value if the sort field is Tokenized and
> stored. This way you'll get the CORRECT sort order

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message