lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Müller" <stephanr.muel...@gmx.de>
Subject Re: Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields
Date Wed, 27 Nov 2013 10:02:03 GMT
Ok, I consider this topic on _this_ list closed. I did a repost on the 'user' list.

Regards,
Stephan

> Gesendet: Dienstag, 26. November 2013 um 23:03 Uhr
> Von: Upayavira <uv@odoko.co.uk>
> An: general@lucene.apache.org
> Betreff: Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued
fields
>
> Stephan,
> 
> This should really go to the Solr user list rather than the general one
> - you might get more response over there.
> 
> Upayavira
> 
> On Tue, Nov 26, 2013, at 01:52 PM, Stephan Müller wrote:
> > Hi,
> > 
> > we are passing a multivalued field to the
> > LanguageIdentifierUpdateProcessor. This multivalued field 
> > contains arbitrary types (Integer, String, Date).
> > Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument
> > doc, String[] fields), 
> > which btw does not use the parameter fields, is unable to parse all
> > fields of the/a multivalued field.
> > The call "Object content = doc.getFieldValue(fieldName);" does not care
> > what type the field is and just 
> > delegates to SolrInputDocument which in turn calls getFirstValue.
> > 
> > So, two issues:
> > first - if the first value of the multivalued field is not of type
> > String, the field is ignored completely.
> > 
> > second - the concat method does not concat all values of a multivalued
> > field. 
> > While
> > http://www.mail-archive.com/solr-user@lucene.apache.org/msg90530.html
> > states:
> > "The feature is designed to detect exactly one language per field.
> > In case of multValued, it will concatenate all values before detection."
> > I don't see how the code could do this.
> > 
> > Is this a bug? Is this a special design decision? Did we miss a certain
> > configuration, that would allow the 
> > Language identification to use all values of a multivalued field?
> > We are about to write our own
> > LangDetectLanguageIdentifierUpdateProcessorFactory (why is the
> > getInstance 
> > hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite
> > LanguageIdentifierUpdateProcessor to
> > handle all values of a multivalued field, ignoring non-string values.
> > 
> > Please see configuration below.
> > 
> > I hope I was able to make myself clear.
> > 
> > Regards,
> > Stephan
> > 
> > 
> > A little background:
> > We are using a 3rd-party CMS framework which pulls in some magic SOLR
> > configuration (namely the textbody field).
> > 
> > The field we are passing is defined as 
> >     <!--
> >       The default text search field.
> >       This field and the field name_tokenized are used as default search
> >       fields
> >       for the /editor and /cmdismax search request handlers in
> >       solrconfig.xml.
> > 
> >       For the Content Feeder the text of all indexed fields of
> >       the CoreMedia document is stored in this field.
> >       The CAE Feeder by default stores the text of all elements in
> >       this field.
> >     -->
> >     <field name="textbody" type="text_general" stored="false"
> >     multiValued="true"/>
> > 
> > As you can see, it is also used as search field, therefor we want to have
> > the actual datatypes on the values.
> > The field itself is generated by a processor, prior to calling the
> > language identification (see processor chain).
> > 
> > 
> > The processor chain:
> >   <updateRequestProcessorChain>
> >     <!-- Improve error messages -->
> >     <processor class="3rdpartypackage.ErrorHandlingProcessorFactory" />
> >     <!-- Blob extraction -->
> >     <processor class="3rdpartypackage.BinaryDataProcessorFactory">
> >     <!-- some comments -->
> >     </processor>
> > 
> >     <!-- Textbody handling -->
> >     <processor class="3rdpartypackage.TextBodyProcessorFactory" />
> >     <!-- Copy content of field name to name_tokenized -->
> >     <processor class="solr.CloneFieldUpdateProcessorFactory">
> >       <str name="source">name</str>
> >       <str name="dest">name_tokenized</str>
> >     </processor>
> >     <!--Language detection -->
> >     <processor
> >     class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
> >       <str name="langid.fl">textbody,name_tokenized</str>
> >       <str name="langid.langField">language</str>
> >       <str name="langid.fallback">en</str>
> >     </processor>
> >     <!-- Index into language dependent fields if defined (e.g.
> >     textbody_en instead of textbody) -->
> >     <processor
> >     class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProcessorFactory">
> >       <str name="languageField">language</str>
> >       <str name="textFields">textbody,name_tokenized</str>
> >     </processor>
> > 
> >     <processor class="solr.RunUpdateProcessorFactory" />
> >   </updateRequestProcessorChain>
> > 
> > 
> > -- 
> > Diese E-Mail wurde aus dem Sicherheitsverbund E-Mail made in
> > Germany versendet: http://www.gmx.net/e-mail-made-in-germany
> 
-- 
Diese E-Mail wurde aus dem Sicherheitsverbund E-Mail made in
Germany versendet: http://www.gmx.net/e-mail-made-in-germany

Mime
View raw message