lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <>
Subject [jira] [Commented] (SOLR-4781) Language profiles embedded twice (langid).
Date Thu, 02 May 2013 10:36:15 GMT


Dawid Weiss commented on SOLR-4781:

I checked out the source code of langdetect. A lot of room for improvement --
- it splits all text into ngrams (list) instead of iterating over it; upon early termination
a lot of these ngrams are not even used
- many data structures over strings (maps etc.); don't think these are needed here.
- full sort() of probabilities is performed just to get the most likely hit (lang).
- the initialization factory takes files just because it cannot iterate over classpath entries
(I presume). 
> Language profiles embedded twice (langid).
> ------------------------------------------
>                 Key: SOLR-4781
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - LangId
>            Reporter: Dawid Weiss
>            Priority: Trivial
> Just something I noticed. Langid imports langdetect from Maven; this includes language
profiles already so a redundant copy is kept in Solr source code (and in target binaries).
All the files except two are identical. The two different profiles are for 'ro' and 'vi' (Romanian
and Vietnamese I presume). I checked the git repo and both have been adjusted by the author
to support some notion of normalization. I think Solr should use the embedded profiles since
they most likely come together with changes in the source code.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message