lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
Date Wed, 21 Nov 2007 23:55:43 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544644
] 

Chuck Williams commented on LUCENE-1052:
----------------------------------------

> It almost feels like we should have "hooks" that are invoked at
> certain times, like when we are about to load the term infos index,
> that give the application a chance to change something...

I agree with the need for some kind of hook.  This is what TermInfosConfigurer is.  It calls
a method whenever a SegmentReader reads an index to obtains parameters (termIndexDivisor)
that should be used to configure the TermInfosReader.

Why not make the setters/getters on SegmentIndexProperties regular non-static methods, and
allow hook methods as well?  E.g., setTermIndexDivsior(), getTermIndexDivisor(), getMaxTermsCached(String
segmentName, int segmentNumDocs, long segmentNumTerms).  Non-static methods make the defaulting
straightforward and allow for subclassing to override hook methods. 

> It sounds like a detector for this would be very useful. It would, e.g., substantially
> speed updates of such indexes, and not slow searches of them like a divisor does.
> At Excite we evolved effective heuristics for wordness to keep our dictionaries from
exploding.

Yes, we are pursuing that approach as well, but we have some stringent requirements in our
market.  E.g., we cannot filter *any* valid content, because searches must be guaranteed to
find all matching results.  As of result of this, we cannot impose any maximum length for
documents.

Any type of binary content recognizer would either need to be 100% accurate, which is impossible,
or require human intervention to validate filtering.  For a human intervention approach to
be viable the false positive rate must be tiny.  To be effective the false negative rate must
be tiny.  Although invalid content is pretty easy for people tor recognize, I've found so
far that high-accuracy recognition rules are surprising subtle.

Do you by chance no of any quality work in this area?

> > int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

> This sounds like a fine approach.

It seems to be working ok, but there is one issue.  Heap's Law is based on the total number
of tokens in the content, not the total number of documents.  I.e., longer documents will
generate more distinct terms than shorter documents.  For large segments the use of numDocs
works ok due to statistical averaging, but for smaller segments there are errors.  I may loosen
the bound somewhat on smaller segments in order to allow for their larger standard deviation.

If Lucene indexes tracked totalTokens (with duplicates, i.e. not numDistinctTokens) that would
be perfect, but they don't.  I don't know whether or not there would be other good uses for
totalTokens but mention its relevance here in case there are.


> Add an "termInfosIndexDivisor" to IndexReader
> ---------------------------------------------
>
>                 Key: LUCENE-1052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1052
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-1052.patch, LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message