lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: No Analyzer, tokenizer or stemmer works at Solr
Date Fri, 08 Jan 2010 14:53:53 GMT
Somewhere, you have to create the document XML you
send to SOLR. Just add the calculated data to
your new field there...

HTH
Erick

On Fri, Jan 8, 2010 at 9:30 AM, MitchK <mitch91@web.de> wrote:

>
> Okay, you're right. It really would be cleaner, if I do such stuff in the
> code which populates the document to Solr.
>
> Is there a way to prepare a document the described way with Lucene/Solr,
> before I analyze it?
> My use case is to categorize several documents in an automatic way, which
> includes that I have to "create" data from the given input doing some
> information retrieval.
>
> The problem is I am really new to Solr and Lucene - as you can see - and I
> do not know, whether there are some classes that fit my needs.
>
> Any idea?
>
>
> Erick Erickson wrote:
> >
> > Well, I'd approach either of these use cases
> > by simply performing my computations on
> > the input and storing the result in another
> > (non-indexed unless I wanted to search it)
> > field. This wouldn't happen in the Analyzer,
> > but in the code that populated the document
> > fields.....
> >
> > Which is a much cleaner solution IMO than creating
> > some sort of "index this but store that" capability.
> > The purpose of analysis is to produce *searchable*
> > tokens after all.
> >
> > But we're getting into angels dancing on pins here. Do
> > you actually have a use case you're trying to implement
> > or is this mostly theoretical?
> >
> > Erick
> >
> > On Thu, Jan 7, 2010 at 2:08 PM, MitchK <mitch91@web.de> wrote:
> >
> >>
> >> The difference between stored and indexed is clear now.
> >>
> >> You are right, if you are responsing only to "normal users".
> >>
> >> Use case:
> >> You got a stored field "The good, the bad and the ugly".
> >> And you got a really fantastic analyzer, which is doing some magic to
> >> this
> >> movie title.
> >> Let's say, the analyzer translates the title into md5 or into another
> >> abstract expression.
> >> Instead of doing the same magical function on the client's side again
> and
> >> again, he only needs to take the prepared data from your response.
> >>
> >> Another use case could be:
> >> Imagine you have got two categories: cheap and expensive and your
> >> document
> >> gots a title-, a label-, an owner- and a price-field.
> >> Imagine you would analyze, index and store them like you normally do and
> >> afterwards you want to set, whether the document belongs to the
> expensive
> >> item-group or not.
> >> If the price for the item is higher than 500$, it belongs to the
> >> expensive
> >> ones, otherwise not.
> >> I think, this would be a job for a special analyzer - and this only
> makes
> >> sense, if I also store the analyzed data.
> >>
> >> I think information retrieval is a really interesting use case.
> >>
> >>
> >> Erick Erickson wrote:
> >> >
> >> > What is your use case for "responding sometimes with the indexed
> >> value"?
> >> > Other than reconstructing a field that hasn't been stored, I can't
> >> think
> >> > of
> >> > one.
> >> >
> >> > I still think you're missing the point. Indexing and storing are
> >> > orthogonal operations that have (almost) nothing to do with each
> >> > other, for all that they happen at the same time on the same field.
> >> >
> >> > You never search against the stored data in a field. You *always*
> >> > search against the indexed data.
> >> >
> >> > Contrariwise, you never display the indexed form to the user, you
> >> > *always* show the stored data (unless you come up with
> >> > a really interesting use case).
> >> >
> >> > Step back and consider what happens when you index data,
> >> > it gets broken up all kinds of ways. Stop words are removed,
> >> > case may change, etc, etc, etc. It makes no sense to
> >> > then display this data for a user. Would you really like
> >> > to have, say a movie title "The Good, The Bad, and The
> >> > Ugly". Remove stopwords, puncuation and lowercase
> >> > and you index three tokens "good", "bad", "ugly".
> >> > Even if you reconstruct this field, the user would see
> >> > "good bad ugly". Bad, very bad.
> >> >
> >> > Yet I want to display the original title to the user in
> >> > response to searching on "ugly", so I need the
> >> > original, unanalyzed data.
> >> >
> >> > Perhaps it would help to think of it this way.
> >> > 1> take some data and index it in f1
> >> >     but do NOT store it in f1. Store it in f2
> >> >     but do NOT index it in f2.
> >> > 2> take that same data, index AND store
> >> >     it in f3.
> >> >
> >> > <1> is almost entirely equivalent to <2>
> >> > in terms of index resources.
> >> >
> >> > Practically though, <1> is harder to use,
> >> > because you have to remember
> >> > to use f1 for searching and f2 for getting
> >> > the raw data.
> >> >
> >> > HTH
> >> > Erick
> >> >
> >> > On Thu, Jan 7, 2010 at 12:11 PM, MitchK <mitch91@web.de> wrote:
> >> >
> >> >>
> >> >> Thank you, Ryan. I will have a look on lucene's material and luke.
> >> >>
> >> >> I think I got it. :)
> >> >>
> >> >> Sometimes there will be the need, to response on the one hand the
> >> value
> >> >> and
> >> >> on the other hand the indexed version of the value.
> >> >> How can I fullfill such needs? Doing copyfield on indexed-only
> fields?
> >> >>
> >> >>
> >> >>
> >> >> ryantxu wrote:
> >> >> >
> >> >> >
> >> >> > On Jan 7, 2010, at 10:50 AM, MitchK wrote:
> >> >> >
> >> >> >>
> >> >> >> Eric,
> >> >> >>
> >> >> >> you mean, everything is okay, but I do not see it?
> >> >> >>
> >> >> >>>> Internally for searching the analysis takes place
and writes to
> >> the
> >> >> >>>> index in an inverted fashion, but the stored stuff
is left
> alone.
> >> >> >>
> >> >> >> if I use an analyzer, Solr "stores" it's output two ways?
> >> >> >> One public output, which is similar to the original input
> >> >> >> and one "hidden" or internal output, which is based on the
> >> >> >> analyzer's work?
> >> >> >> Did I understand that right?
> >> >> >
> >> >> > yes.
> >> >> >
> >> >> > indexed fields and stored fields are different.
> >> >> >
> >> >> > Solr results show stored fields in the results (however facets
are
> >> >> > based on indexed fields)
> >> >> >
> >> >> > Take a look at Lucene in Action for a better description of what
is
> >> >> > happening.  The best tool to get your head around what is happening
> >> is
> >> >> > probably luke (http://www.getopt.org/luke/)
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> If yes, I have got another problem:
> >> >> >> I don't want to waste any diskspace.
> >> >> >
> >> >> > You have control over what is stored and what is indexed -- how
> that
> >> >> > is configured is up to you.
> >> >> >
> >> >> > ryan
> >> >> >
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
> >> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27076795.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message