lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Mahy <t...@infosupport.com>
Subject RE: multi-language searching with Solr
Date Tue, 06 May 2008 18:40:00 GMT
Hi,

you could also use multiple Solr instances having specific settings and stopwords etc for
the same field and upload your documents to the correct instance

and than merge the indexes to one searchable index .......

greetings,
Tim
________________________________________
Van: Eli K [system.out@gmail.com]
Verzonden: dinsdag 6 mei 2008 18:26
Aan: solr-user@lucene.apache.org
Onderwerp: Re: multi-language searching with Solr

Peter,

Thanks for your help, I will prototype your solution and see if it
makes sense for me.

Eli

On Mon, May 5, 2008 at 5:38 PM, Binkley, Peter
<Peter.Binkley@ualberta.ca> wrote:
> It won't make much difference to the index size, since you'll only be
>  populating one of the language fields for each document, and empty
>  fields cost nothing. The performance may suffer a bit but Lucene may
>  surprise you with how good it is with that kind of boolean query.
>
>  I agree that as the number of fields and languages increases, this is
>  going to become a lot to manage. But you're up against some basic
>  problems when you try to model this in Solr: for each token, you care
>  about not just its value (which is all Lucene cares about) but also its
>  language and its stem; and the stem for a given token depends on the
>  language (different stemming rules); and at query time you may not know
>  the language. I don't think you're going to get a solution without some
>  redundancy; but solving problems by adding redundant fields is a common
>  method in Solr.
>
>
>  Peter
>
>
>  -----Original Message-----
>  From: Eli K [mailto:system.out@gmail.com]
>
> Sent: Monday, May 05, 2008 2:28 PM
>  To: solr-user@lucene.apache.org
>
>
> Subject: Re: multi-language searching with Solr
>
>  Wouldn't this impact both indexing and search performance and the size
>  of the index?
>  It is also probable that I will have more then one free text fields
>  later on and with at least 20 languages this approach does not seem very
>  manageable.  Are there other options for making this work with stemming?
>
>  Thanks,
>
>  Eli
>
>
>  On Mon, May 5, 2008 at 3:41 PM, Binkley, Peter
>  <Peter.Binkley@ualberta.ca> wrote:
>  > I think you would have to declare a separate field for each language
>  > (freetext_en, freetext_fr, etc.), each with its own appropriate
>  > stemming. Your ingestion process would have to assign the free text
>  > content for each document to the appropriate field; so, for each
>  > document, only one of the freetext fields would be populated. At
>  > search  time, you would either search against the appropriate field if
>
>  > you know  the search language, or search across them with
>  > "freetext_fr:query OR  freetext_en:query OR ...". That way your query
>  > will be interpreted by  each language field using that language's
>  stemming rules.
>  >
>  >  Other options for combining indexes, such as copyfield or dynamic
>  > fields  (see http://wiki.apache.org/solr/SchemaXml), would lead to a
>  > single  field type and therefore a single type of stemming. You could
>  > always use  copyfield to create an unstemmed common index, if you
>  > don't care about  stemming when you search across languages (since
>  > you're likely to get  odd results when a query in one language is
>  > stemmed according to the  rules of another language).
>  >
>  >  Peter
>  >
>  >
>  >
>  >  -----Original Message-----
>  >  From: Eli K [mailto:system.out@gmail.com]
>  >  Sent: Monday, May 05, 2008 8:27 AM
>  >  To: solr-user@lucene.apache.org
>  >  Subject: multi-language searching with Solr
>  >
>  >  Hello folks,
>  >
>  >  Let me start by saying that I am new to Lucene and Solr.
>  >
>  >  I am in the process of designing a search back-end for a system that
>
>  > receives 20k documents a day and needs to keep them available for 30
>  > days.  The documents should be searchable on a free text field and on
>
>  > about 8 other fields.
>  >
>  >  One of my requirements is to index and search documents in multiple
>  > languages.  I would like to have the ability to stem and provide the
>  > advanced search features that are based on it.  This will only affect
>
>  > the free text field because the rest of the fields are in English.
>  >
>  >  I can find out the language of the document before indexing and I
>  > might  be able to provide the language to search on.  I also need to
>  > have the  ability to search across all indexed languages (there will
>  > be 20 in  total).
>  >
>  >  Given these requirements do you think this is doable with Solr?  A
>  > major  limiting factor is that I need to stick to the 1.2 GA version
>  > and I  cannot utilize the multi-core features in the 1.3 trunk.
>  >
>  >  I considered writing my own analyzer that will call the appropriate
>  > Lucene analyzer for the given language but I did not see any way for
>  > it  to access the field that specifies the language of the document.
>  >
>  >  Thanks,
>  >
>  >  Eli
>  >
>  >  p.s. I am looking for an experienced Lucene/Solr consultant to help
>  > with  the design of this system.
>  >
>  >
>
>




Please see our disclaimer, http://www.infosupport.be/Pages/Disclaimer.aspx 

Mime
View raw message