lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2257) relax the per-segment max unique term limit
Date Thu, 11 Feb 2010 23:26:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832729#action_12832729
] 

Michael McCandless commented on LUCENE-2257:
--------------------------------------------

bq. With the patch, we don't see any ArrayIndexOutOfBounds exceptions.

Great!  And the results look correct?

bq. Other than walking though the code in the debugger, is there some systematic way of looking
for any other places where an int is used that might also have problems when we have over
2.1x billion terms?

Not that I know of!  The code that handles the term dict lookup is
fairly contained, in TermInfosReader and SegmentTermEnum.  I think
scrutinizing the code and testing (as you're doing) is the only way.

I just looked again -- there are a few places where int is still being used.

First is two methods (get(int position) and scanEnum), in
TermInfosReader, that are actually dead code (package private &
unused).  Second is int SegmentTermEnum.scanTo, but this is fine
because it's never asked to can more than termIndexInterval terms.

I'll attach patch that additionally just removes that dead code.


> relax the per-segment max unique term limit
> -------------------------------------------
>
>                 Key: LUCENE-2257
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2257
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9.2, 3.0.1, 3.1
>
>         Attachments: LUCENE-2257.patch, LUCENE-2257.patch
>
>
> Lucene can't handle more than 2.1B (limit of signed 32 bit int) unique terms in a single
segment.
> But I think we can improve this to termIndexInterval (default 128) * 2.1B.  There is
one place (internal API only) where Lucene uses an int but should use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message