lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Thomerson <jer...@thomersonfamily.com>
Subject Multiple Languages in Same Core
Date Mon, 24 Mar 2014 21:43:12 GMT
I recently deployed Solr to back the site search feature of a site I work
on. The site itself is available in hundreds of languages. With the initial
release of site search we have enabled the feature for ten of those
languages. This is distributed across eight cores, with two Chinese
languages plus Korean combined into one CJK core and each of the other
seven languages in their own individual cores. The reason for splitting
these into separate cores was so that we could have the same field names
across all cores but have different configuration for analyzers, etc, per
core.

Now I have some questions on this approach.

1) Scalability: Considering I need to scale this to many dozens more
languages, perhaps hundreds more, is there a better way so that I don't end
up needing dozens or hundreds of cores? My initial plan was that many
languages that didn't have special support within Solr would simply get
lumped into a single "default" core that has some default analyzers that
are applicable to the majority of languages.

1b) Related to this: is there a practical limit to the number of cores that
can be run on one instance of Lucene?

2) Auto Suggest: In phase two I intend to add auto-suggestions as a user
types a query. In reviewing how this is implemented and how the suggestion
dictionary is built I have concerns. If I have more than one language in a
single core (and I keep the same field name for suggestions on all
languages within a core) then it seems that I could get suggestions from
another language returned with a suggest query. Is there a way to build a
separate dictionary for each language, but keep these languages within the
same core?

If it's helpful to know: I have a field in every core for "Locale". Values
will be the locale of the language of that document, i.e. "en", "es",
"zh_hans", etc. I'd like to be able to: 1) when building a suggestion
dictionary, divide it into multiple dictionaries, grouping them by locale,
and 2) supply a parameter to the suggest query that allows the suggest
component to only return suggestions from the appropriate dictionary for
that locale.

If the answer to #1 is "keep splitting groups of languages that have
different analyzers into their own cores" and the answer to #2 is "that's
not supported", then I'd be curious: where would I start to write my own
extension that supported #2? I looked last night at the suggest lookup
classes, dictionary classes, etc. But I didn't see a clear point where it
would be clean to implement something like I'm suggesting above.

Best Regards,
Jeremy Thomerson

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message