lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trey Grainger <solrt...@gmail.com>
Subject Re: Single multilingual field analyzed based on other field values
Date Tue, 29 Oct 2013 00:07:52 GMT
Hi David,

What version of the Solr in Action MEAP are you looking at (current version
is 12, and version 13 is coming out later this week, and prior versions had
significant bugs in the code you are referencing)?  I added an update
processor in the most recent version that can do language identification
and prepend the language codes for you (even removing them from the stored
version of the field and only including them on the indexed version for
text analysis).

You could easily modify this update processor to read the value from the
language field and use it as the basis of the pre-pended languages.

Otherwise, if you want to do language detection instead of passing in the
language manually, MultiTextField in chapter 14 of Solr in Action and the
corresponding MultiTextFieldLanguageIdentifierUpdateProcessor should handle
all of the language detection and pre-pending automatically for you (and
also append the identified language to a separate field).

If it were easy/possible to have access to the rest of the fields in the
document from within a field's Analyzer then I would have certainly opted
for that approach instead of the whole pre-pending languages to content
option.  If it is too cumbersome, you could probably rewrite the
MultiTextField to pull the language from the field name instead of the
content  (i.e.  <field name="myField|en,fr">blah, blah</field> instead of
<field name="myField">en,fr|blah, blah</field> as currently designed).
 This would make specifying the language much easier (especially at query
time since you only have to specify the languages once instead of on each
term), and you could have Solr still search the same underlying field for
all languages.  Same general idea, though.

In terms of your ThreadLocal cache idea... that sounds really scary to me.
 The Analyzers' TokenStreamComponents are cached in a ThreadLocal context
depending upon to the internal ReusePolicy, and I'm skeptical that you'll
be able to pull this off cleanly.  It would really be hacking around the
Lucene API's even if you were able to pull it off.

-Trey


On Mon, Oct 28, 2013 at 5:15 PM, Jack Krupansky <jack@basetechnology.com>wrote:

> Consider an update processor - it can operate on any field and has access
> to all fields.
>
> You could have one update processor to combine all the fields to process,
> into a temporary, dummy field. Then run a language detection update
> processor on the combined field. Then process the results and place in the
> desired field. And finally remove any temporary fields.
>
> -- Jack Krupansky
> -----Original Message----- From: David Anthony Troiano
> Sent: Monday, October 28, 2013 4:47 PM
> To: solr-user@lucene.apache.org
> Subject: Single multilingual field analyzed based on other field values
>
>
> Hello,
>
> First some background...
>
> I am indexing a multilingual document set where documents themselves can
> contain multiple languages.  The language(s) within my documents are known
> ahead of time.  I have tried separate fields per language, and due to the
> poor query performance I'm seeing with that approach (many languages /
> fields), I'm trying to create a single multilingual field.
>
> One approach to this problem is given in Section
> 14.6.4<https://docs.google.**com/a/basistech.com/file/d/**
> 0B3NlE_uL0pqwR0hGV0M1QXBmZm8/**edit<https://docs.google.com/a/basistech.com/file/d/0B3NlE_uL0pqwR0hGV0M1QXBmZm8/edit>
> >of
> the new Solr In Action book.  The approach is to take the document
> content field and prepend it with the list contained languages followed by
> a special delimiter.  A new field type is defined that maps languages to
> sub field types, and the new type's tokenizer then runs all of the sub
> field type analyzers over the field and merges results, adjusts offsets for
> the prepended data, etc.
>
> Due to the tokenizer complexity incurred, I'd like to pursue a more
> flexible approach, which is to run the various language-specific analyzers
> not based on prepended codes, but instead based on other field values
> (i.e., a language field).
>
> I don't see a straightforward way to do this, mostly because a field
> analyzer doesn't have access to the rest of the document.  On the flip
> side, an UpdateRequestProcessor would have access to the document but
> doesn't really give a path to wind up where I want to be (single field with
> different analyzers run dynamically).
>
> Finally, my question: is it possible to thread cache document language(s)
> during UpdateRequestProcessor execution (where we have access to the full
> document), so that the analyzer can then read from the cache to determine
> which analyzer(s) to run?  More specifically, if a document is run through
> it's URP chain on thread T, will its analyzer(s) also run on thread T and
> will no other documents be run through the URP on that thread in the
> interim?
>
> Thanks,
> Dave
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message