lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3807) Cleanup suggester API
Date Mon, 20 Feb 2012 15:25:34 GMT


Robert Muir commented on LUCENE-3807:

I like the patch, but only one thing (its fine to commit it as-is though, we can solve this
on another issue, i just couldnt help but notice)

I don't think we should have the BufferedTermFreqIteratorWrapper/etc and the SortedTermFreqIterator
marker interface needs to be fixed.

Here are the problems:
* Marker interface SortedTermFreqIterator doesn't tell you if its UTF-8 or UTF-16 order. Its
implemented by two classes: SortedTermFreqIteratorWrapper,
which sorts in UTF-16 order, and HighFrequencyDictionary, which returns terms from the index
(so UTF-8 order). The problem is that classes
that rely upon sorted order like JaSpell/TST are likely broken already. Fortunately FST/WFST
always do their own sort.
* Buffering in RAM is not ideal. Instead I think all of these classes should be using our
Sort anyway which can spill to disk.

For now could we put the BytesRefList in the suggest package since its only used there? we
might not need it after we clean up
this sorting stuff in some future issue.

Also I don't think we should factor out the BytesRefIterator. I seriously think its a bad
idea to tie our core index Terms enumeration API
with the spellcheck API at this time, it would make it hard to change in the future if we
need, especially with spellcheck being... needing work :)

> Cleanup suggester API
> ---------------------
>                 Key: LUCENE-3807
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/other
>    Affects Versions: 3.6, 4.0
>            Reporter: Simon Willnauer
>             Fix For: 4.0
>         Attachments: LUCENE-3807.patch
> Currently the suggester api and especially TermFreqIterator don't play that nice with
BytesRef and other paradigms we use in lucene, further the java iterator pattern isn't that
useful when it gets to work with TermsEnum, BytesRef etc. We should try to clean up this api
step by step moving over to BytesRef including the Lookup class and its interface...

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message