Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of lists@nabble.com designates
 216.139.236.158 as permitted sender)
Message-ID: <11038890.post@talk.nabble.com>
Date: Sat, 9 Jun 2007 02:53:14 -0700 (PDT)
From: Henrib <hbiestro@gmail.com>
To: solr-user@lucene.apache.org
Subject: Re: Multi-language indexing and searching
In-Reply-To: <C28F289F.2163%Daniel.Alheiros@bbc.co.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
References: <C28E034B.2103%Daniel.Alheiros@bbc.co.uk>
 <11027333.post@talk.nabble.com> <C28F289F.2163%Daniel.Alheiros@bbc.co.uk>


Hi Daniel,
Trying to recap: you are indexing documents that can be in different
language. On the query side, users will only search in one language at a
time & get results in that language.

Setting aside the webapp deployment problem, the alternative is thus:
option1: 1 schema will all fields of all languages pre-defined
option2: 1 schema per lang with the same field names (but a different type).

You indicate that your documents do have a field carrying the language. Is
the Solr document format the authoring format of the documents you index or
do they require some pre-processing to extract those fields? For instance,
are the source documents in HTML and pre-processed using some XPath/magic to
generate the fields?
In that case, using option1, the pre-processing transformation needs to know
which fields to generate according to the language. Option2 needs you to
know which core you need to target based on the lang. And it goes the same
way for querying; for option1, you need a query with different fields for
each language, option2 requires to target the correct core.
In the other case, ie if the Solr document format is the source format,
indexing requires some script (curl or else) to send them to Solr; having
the script determine which core to target don't seem (from far) a hard task
(grep/awk  to the rescue :-)).

On the maintenance side, if you were to change the schema, need to reindex
one lang or add a lang, option1 seems to have a 'wider' impact, the
functional grain being coarser. Besides, if your collections are huge or
grow fast, it might be nice to have an easy way to partition the workload on
different machines which seems easier with option2, directing indexing &
queries to a site based on the lang.

On the webapp deployment side, option1 is a breeze, option2 requires
multiple web-app (Forgetting solr-215 patch that is unlikely to be reviewed
and accepted soon since its functional value is not shared).

Hope this helps in your choice, regards,
Henri


Daniel Alheiros wrote:
> 
> Hi Henri.
> 
> Thanks for your reply.
> I've just looked at the patch you referred, but doing this I will lose the
> out of the box Solr installation... I'll have to create my own Solr
> application responsible for creating the multiple cores and I'll have to
> change my indexing process to something able to notify content for a
> specific core.
> 
> Can't I have the same index, using one single core, same field names being
> processed by language specific components based on a field/parameter?
> 
> I will try to draw what I'm thinking, please forgive me if I'm not using
> the
> correct terms but I'm not an IR expert.
> 
> Thinking in a workflow:
>     Indexing:
>         Multilanguage indexer receives some documents
>             for each document, verify the "language" field
>                 if language = "English" then process using the
> EnglishIndexer
>                 else if language = "Chinese" then process using the
> ChineseIndexer
>                 else if ...
> 
>     Querying:
>         Multilanguage Request Handler receives a request
>             if parameter language = "English" then process using the
> English
> Request Handler
>             else if parameter language = "Chinese" then process using the
> Chinese Request Handler
>             else if ...
> 
> I can see that in the schema field definitions, we have some language
> dependent parameters... It can be a problem, as I would like to have the
> same fields for all requests...
> 
> Sorry to bother, but before I split all my data this way I would like to
> be
> sure that it's the best approach for me.
> 
> Regards,
> Daniel        
> 
> 
> On 8/6/07 15:15, "Henrib" <hbiestro@gmail.com> wrote:
> 
>> 
>> Hi Daniel,
>> If it is functionally 'ok' to search in only one lang at a time, you
>> could
>> try having one index per lang. Each per-lang index would have one schema
>> where you would describe field types (the lang part coming through
>> stemming/snowball analyzers, per-lang stopwords & al) and the same field
>> name could be used in each of them.
>> You could either deploy that solution through multiple web-apps (one per
>> lang) (or try the patch for issue Solr-215).
>> Regards,
>> Henri
>> 
>> 
>> Daniel Alheiros wrote:
>>> 
>>> Hi, 
>>> 
>>> I'm just starting to use Solr and so far, it has been a very interesting
>>> learning process. I wasn't a Lucene user, so I'm learning a lot about
>>> both.
>>> 
>>> My problem is:
>>> I have to index and search content in several languages.
>>> 
>>> My scenario is a bit different from other that I've already read in this
>>> forum, as my client is the same to search any language and it could be
>>> accomplished using a field to define language.
>>> 
>>> My questions are more focused on how to keep the benefits of all the
>>> protwords, stopwords and synonyms in a multilanguage situation....
>>> 
>>> Should I create new Analyzers that can deal with the "language" field of
>>> the
>>> document? What do you recommend?
>>> 
>>> Regards,
>>> Daniel 
>>> 
>>> 
>>> http://www.bbc.co.uk/
>>> This e-mail (and any attachments) is confidential and may contain
>>> personal
>>> views which are not the views of the BBC unless specifically stated.
>>> If you have received it in error, please delete it from your system.
>>> Do not use, copy or disclose the information in any way nor act in
>>> reliance on it and notify the sender immediately.
>>> Please note that the BBC monitors e-mails sent or received.
>>> Further communication will signify your consent to this.
>>> 
>>> 
>>> 
> 
> 
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain personal
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in
> reliance on it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
> 					
> 
> 

-- 
View this message in context: http://www.nabble.com/Multi-language-indexing-and-searching-tf3885324.html#a11038890
Sent from the Solr - User mailing list archive at Nabble.com.