lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields
Date Tue, 13 Mar 2007 16:57:09 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480476
] 

Enis Soztutar commented on LUCENE-252:
--------------------------------------

I should admit that, considering the case Yonik has mentioned. throwing an exception by checking
the Field.isTokenized() is not suitable. However the check if (t >= mterms.length) is only
in getStringIndex() and not in getStrings(). I think that a more robust check then the aforementioned
should be included in both getStrings and getStringIndex functions. A possibility would be
to allocate a boolean array(or BitSet) of the same size with the retArray, and  then use the
array to avoid multiple terms per document.  

> 2) the desired behavior you are requesting in a StoredFieldCacheImpl could be done without
making any changes to what so ever to FieldCacheImpl -- since nutch knows exactly which fields
it's indexing multiple tokens for, it can make the choice between using a StoredFieldCacheImple
or using a FieldCacheImpl.

from my previous post  :  In nutch we have 3 options : 1st is to disallow deleting duplicates
on tokenized fields(due to FieldCache), 2nd is to index the tokenized field twice(once tokenized,
and once untokenized), 3rd use the above patch and warm the cache initially in the index servers.

Yes indexing a field a second time is an option, but considering my use cases with nutch,
why would i want to grow my index by indexing the field twice, instead of tolerating 30 seconds
of cache building in a web server, which will serve the indexes for days or even weeks. 

with a class like StoredFieldCacheImpl we can get the desired behaviour w/o modifiying the
FieldCacheImpl, and my suggestion in my previous post  without the 1st part does just this.
I couldl have sent this to nutch but i think it is a lucene issue. 

Any more suggestions ?





> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>
>                 Key: LUCENE-252
>                 URL: https://issues.apache.org/jira/browse/LUCENE-252
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.4
>         Environment: Operating System: other
> Platform: All
>            Reporter: Aviran Mordo
>         Assigned To: Lucene Developers
>         Attachments: dif.txt, FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch,
FieldCacheImpl_Tokenized_fields_lucene_2.2-dev.patch
>
>
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off 
> especially with more then one word in the field. I think it is much 
> more logical to sort by field's string value if the sort field is Tokenized and
> stored. This way you'll get the CORRECT sort order

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message