Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 36771 invoked from network); 1 Apr 2010 13:02:26 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Apr 2010 13:02:26 -0000 Received: (qmail 14459 invoked by uid 500); 1 Apr 2010 13:02:23 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 14435 invoked by uid 500); 1 Apr 2010 13:02:23 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Delivered-To: moderator for java-user@lucene.apache.org Received: (qmail 54761 invoked by uid 99); 1 Apr 2010 12:19:33 -0000 X-ASF-Spam-Status: No, hits=-0.4 required=10.0 tests=AWL,SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Date: Thu, 1 Apr 2010 04:19:07 -0800 (PST) From: henrib To: java-user@lucene.apache.org Message-ID: <1270124347984-690625.post@n3.nabble.com> In-Reply-To: <817123.44413.qm@web24101.mail.ird.yahoo.com> References: <817123.44413.qm@web24101.mail.ird.yahoo.com> Subject: Re: Designing a multilingual index MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Hi, I worked some time ago on a similar system (using Solr) and used the multiple indices route (the multicore feature in Solr). In our case, the "same" document could exist in different languages; different localized versions of the same information (same Solr unique id for each l10n version). This allowed to have the same index structure across locales but different settings for each (synonyms, stemmers, etc). Maintenance was easier this way; when refining/updating the settings (say adding synonyms or stemmers for instance), you may need to reindex and smaller indices allow faster deployments. It's also "dead-easy" to add a new language (esp. compared to the one index solution). It also makes replication or partitioning easier. Overall, IMO, this is a more scalable architecture than the single-index one. Users were able to set in which language they were "fluent" (default being browser locale) so queries would only be performed in those and results "clustered" per locale (no need to return results that can not be understood...). Besides, IMO, scoring / ordering documents in different languages is a bit like comparing apples and oranges. Finally, query expansion can also be used in the multiple indices case and might even use automated/guided translation. In my experience, multiple indices had many advantages over the single index solution, be them functional or operational. YMMV. Hope this helps, Henrib -- View this message in context: http://n3.nabble.com/Designing-a-multilingual-index-tp688766p690625.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org