Return-Path: Delivered-To: apmail-lucene-solr-user-archive@locus.apache.org Received: (qmail 92200 invoked from network); 9 Jun 2007 09:53:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 9 Jun 2007 09:53:38 -0000 Received: (qmail 8136 invoked by uid 500); 9 Jun 2007 09:53:40 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 8106 invoked by uid 500); 9 Jun 2007 09:53:39 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 8097 invoked by uid 99); 9 Jun 2007 09:53:39 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Jun 2007 02:53:39 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Jun 2007 02:53:35 -0700 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1HwxdC-0001L5-Cd for solr-user@lucene.apache.org; Sat, 09 Jun 2007 02:53:14 -0700 Message-ID: <11038890.post@talk.nabble.com> Date: Sat, 9 Jun 2007 02:53:14 -0700 (PDT) From: Henrib To: solr-user@lucene.apache.org Subject: Re: Multi-language indexing and searching In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: hbiestro@gmail.com References: <11027333.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org Hi Daniel, Trying to recap: you are indexing documents that can be in different language. On the query side, users will only search in one language at a time & get results in that language. Setting aside the webapp deployment problem, the alternative is thus: option1: 1 schema will all fields of all languages pre-defined option2: 1 schema per lang with the same field names (but a different type). You indicate that your documents do have a field carrying the language. Is the Solr document format the authoring format of the documents you index or do they require some pre-processing to extract those fields? For instance, are the source documents in HTML and pre-processed using some XPath/magic to generate the fields? In that case, using option1, the pre-processing transformation needs to know which fields to generate according to the language. Option2 needs you to know which core you need to target based on the lang. And it goes the same way for querying; for option1, you need a query with different fields for each language, option2 requires to target the correct core. In the other case, ie if the Solr document format is the source format, indexing requires some script (curl or else) to send them to Solr; having the script determine which core to target don't seem (from far) a hard task (grep/awk to the rescue :-)). On the maintenance side, if you were to change the schema, need to reindex one lang or add a lang, option1 seems to have a 'wider' impact, the functional grain being coarser. Besides, if your collections are huge or grow fast, it might be nice to have an easy way to partition the workload on different machines which seems easier with option2, directing indexing & queries to a site based on the lang. On the webapp deployment side, option1 is a breeze, option2 requires multiple web-app (Forgetting solr-215 patch that is unlikely to be reviewed and accepted soon since its functional value is not shared). Hope this helps in your choice, regards, Henri Daniel Alheiros wrote: > > Hi Henri. > > Thanks for your reply. > I've just looked at the patch you referred, but doing this I will lose the > out of the box Solr installation... I'll have to create my own Solr > application responsible for creating the multiple cores and I'll have to > change my indexing process to something able to notify content for a > specific core. > > Can't I have the same index, using one single core, same field names being > processed by language specific components based on a field/parameter? > > I will try to draw what I'm thinking, please forgive me if I'm not using > the > correct terms but I'm not an IR expert. > > Thinking in a workflow: > Indexing: > Multilanguage indexer receives some documents > for each document, verify the "language" field > if language = "English" then process using the > EnglishIndexer > else if language = "Chinese" then process using the > ChineseIndexer > else if ... > > Querying: > Multilanguage Request Handler receives a request > if parameter language = "English" then process using the > English > Request Handler > else if parameter language = "Chinese" then process using the > Chinese Request Handler > else if ... > > I can see that in the schema field definitions, we have some language > dependent parameters... It can be a problem, as I would like to have the > same fields for all requests... > > Sorry to bother, but before I split all my data this way I would like to > be > sure that it's the best approach for me. > > Regards, > Daniel > > > On 8/6/07 15:15, "Henrib" wrote: > >> >> Hi Daniel, >> If it is functionally 'ok' to search in only one lang at a time, you >> could >> try having one index per lang. Each per-lang index would have one schema >> where you would describe field types (the lang part coming through >> stemming/snowball analyzers, per-lang stopwords & al) and the same field >> name could be used in each of them. >> You could either deploy that solution through multiple web-apps (one per >> lang) (or try the patch for issue Solr-215). >> Regards, >> Henri >> >> >> Daniel Alheiros wrote: >>> >>> Hi, >>> >>> I'm just starting to use Solr and so far, it has been a very interesting >>> learning process. I wasn't a Lucene user, so I'm learning a lot about >>> both. >>> >>> My problem is: >>> I have to index and search content in several languages. >>> >>> My scenario is a bit different from other that I've already read in this >>> forum, as my client is the same to search any language and it could be >>> accomplished using a field to define language. >>> >>> My questions are more focused on how to keep the benefits of all the >>> protwords, stopwords and synonyms in a multilanguage situation.... >>> >>> Should I create new Analyzers that can deal with the "language" field of >>> the >>> document? What do you recommend? >>> >>> Regards, >>> Daniel >>> >>> >>> http://www.bbc.co.uk/ >>> This e-mail (and any attachments) is confidential and may contain >>> personal >>> views which are not the views of the BBC unless specifically stated. >>> If you have received it in error, please delete it from your system. >>> Do not use, copy or disclose the information in any way nor act in >>> reliance on it and notify the sender immediately. >>> Please note that the BBC monitors e-mails sent or received. >>> Further communication will signify your consent to this. >>> >>> >>> > > > http://www.bbc.co.uk/ > This e-mail (and any attachments) is confidential and may contain personal > views which are not the views of the BBC unless specifically stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in > reliance on it and notify the sender immediately. > Please note that the BBC monitors e-mails sent or received. > Further communication will signify your consent to this. > > > -- View this message in context: http://www.nabble.com/Multi-language-indexing-and-searching-tf3885324.html#a11038890 Sent from the Solr - User mailing list archive at Nabble.com.