lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: Passing arguments to analyzers
Date Tue, 17 Jul 2007 15:40:41 GMT
On 7/17/07, Doğacan Güney <dogacan@gmail.com> wrote:
> Hi,
>
> On 7/17/07, Yonik Seeley <yonik@apache.org> wrote:
> > On 7/17/07, Doğacan Güney <dogacan@gmail.com> wrote:
> > > Hi all,
> > >
> > > Is there a way to pass arguments to analyzers per document? Let's say
> > > that I have a field "foo" which is tokenized by WhitespaceTokenizer
> > > and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
> > > can stem more than one language but (obviously) it needs to know the
> > > language of the document it is working on. So what I need is to
> > > specify the language per document (actually per field).
> > >
> > > Here is an example:
> > > <doc>
> > >    <field name="....
> > >     .....
> > >     <field name="foo" lang="en">My spam egg bars baz.</field>
> > > </doc>
> > >
> > > Is something like this possible with Solr?
> >
> > You can pass extra args to a factory in the field-type definition, but
> > that means you would need a separate field-type per language.
>
> Thanks for the answer.
>
> Your suggestion would work for this particular use case, but IMHO
> there are other use cases out there that can benefit (for example, one
> may process the whole document and add parameters for each field based
> on document-level analysis) from this.
>
> Would this be useful feature for Solr? I would actually like to work
> on it if others consider this as a useful add-on. It seems simple to
> accomplish and it would probably be a good introduction to Solr
> internals.

wrt passing more info to the analyzer at runtime to alter its
behavior: analyzers are singletons per field-type, and
Analyzer.tokenStream(String fieldName, Reader reader) is called to
analyze a particular value.  There isn't really a good place to pass
in extra info.

During XML parsing, we *could* build up a Map of the parameters we
don't know about, but then the question is what to do with them.  One
hackish solution would be to store them in a thread-local where your
analyzer could check it.  Perhaps a custom request processor could do
that task.

It seems there does need to be some kind of framework more aligned
with parsing documents (word docs, pdf, etc), for adding metadata to
fields at runtime (how does UIMA or Tika fit into this?), and for
mapping the fields+metadata to Solr/Lucene document fields.

-Yonik
Mime
View raw message