lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <>
Subject [jira] Commented: (LUCENE-252) [PATCH] Problem with Sort logic on tokenized fields
Date Fri, 09 Mar 2007 10:28:24 GMT


Enis Soztutar commented on LUCENE-252:

I also agree with tokenized field caching, which is a use case for nutch. Let me elaborate
on the use case. In a nutch deployment, we generate indexes from the web documents, and indeed
the set of fields is known a priori. Then the indexes are distributed to several index servers
running on hadoop's RPC calls. Then the query is sent to all of the index servers, the results
are collected and merged on the fly. Since the indexes need not be disjoint(since crawling
is an adaptive process) the results should be merged, without having a document more then
once. So we need a unique key to represent the document. Default nutch codebase uses the site
field(url's hostname), which is untokenized for such a task, and allow only 1 - 2 documents
from a site in the search results. For obvious performance reasons, the site field is cached
in the index servers with FieldCache.getStrings(). The problem arises when we want to show
more than one result from a specific site (for example in a query ), and if
we have the same url indexed in more than one index server. We use the tokenized url field
in the FieldCache, then deleting duplicates becomes error prone. Since we use FieldCache.getStrings()
rather that FieldCache.getStringIndex(), the problem here is not tokenized field sorting,
but tokenized field not caching correctly, an example of which is an array like [com, edu.
www, youtube, ] from the getStrings() method(for each doc, only a token is returned, rather
than the whole url). 

Well, if you are still with me, here is my proposal : 

1. in in both getStrings and getStringIndex functions add 

Field docField = getField(reader, field);
      if (docField != null && docField.isStored() && docField.isTokenized())
           throw new RuntimeException("Caching in Tokenized Fields is not allowed");

2. subclass FieldCacheImpl as StoredFieldCacheImpl and implement stored field caching there,
delegating untokenized fields to super class
3. add the implementation to :

 public static FieldCache DEFAULT = new FieldCacheImpl();
 public static FieldCache STORED_CACHE = new StoredCacheImpl();

this way both lucene internals will not be affected and a stored field caching could be performed.

> [PATCH] Problem with Sort logic on tokenized fields
> ---------------------------------------------------
>                 Key: LUCENE-252
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.4
>         Environment: Operating System: other
> Platform: All
>            Reporter: Aviran Mordo
>         Assigned To: Lucene Developers
>         Attachments: dif.txt, FieldCacheImpl_Tokenized_fields_lucene_2.0.patch, FieldCacheImpl_Tokenized_fields_lucene_2.0_v1.1.patch,
> When you set s SortField to a Text field which gets tokenized
> FieldCacheImpl uses the term to do the sort, but then sorting is off 
> especially with more then one word in the field. I think it is much 
> more logical to sort by field's string value if the sort field is Tokenized and
> stored. This way you'll get the CORRECT sort order

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message