Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Date: Fri, 30 Apr 2010 14:16:50 -0700 (PDT)
From: MitchK <mitch91@web.de>
To: java-user@lucene.apache.org
Message-ID: <1272662210449-768902.post@n3.nabble.com>
In-Reply-To: <E77D01D3-EB7D-4C8C-A817-03BE5A0E5D9E@apache.org>
References: <E6C119F7-C966-4280-94DD-3575BB3CE3F4@apache.org>
 <x2i3504767f1004290759u6de20a3ehfa4d46fb11a97954@mail.gmail.com>
 <h2r698f0de11004300500g71aba2b1g21378ba061b674c@mail.gmail.com>
 <E77D01D3-EB7D-4C8C-A817-03BE5A0E5D9E@apache.org>
Subject: Re: Relevancy Practices
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


I found your thread at the Solr-user-list. However, it seems like your topic
belongs more to Lucene in general?
I copy my posting from there, so that everything is accessible by one
thread.
--------------------------------------------------------------------------

I think the problems one has to solve are depending on the usecases one has
to deal with. 
It makes a difference whether I got much documents that are bloody similar
but with different contexts and I have to determine what query applies to
what context in what probability for which document - or if I have lots of
editorialy managed documents with relatively clear contexts, because they
offer human-created tags etc. 

I haven't made much experiences with Solr (and no experiences in a
productive environment). However, those experiences I have made show that
spliting the document's context in as small parts as possible is always a
good idea. 
I don't mean splitting in a sense of making the part's of a document
smaller. I mean that in a way of making it easier to decide which part of a
document is more important than another. 
e.g.: I got a social network and every user is able to create his or her own
blog - as a corporation I want to make them all searchable. It would be
beneficial for high-quality search, if I am able to extract the
introduction, the category  (maybe added by the author). 

According to this: If this is not done by people, or not well done enough,
than I need to do so algorithmically. 
e.g.: 
If I got a dictionary of person-names, than I could use the keepWordFilter
to create a field I can facet *and* boost on. 
Let's say the user writes about Paris Hilton, Barrack Obama or any other
well known person, than I can extract their names from the content in an
easy way - of course this could be done better, but that's not the point
here. 
If I search for "Obama's speech" all documents with "Obama" could get a
boost. 
The difference between the solution without this keepWordFilter-feature
would be, that Solr does not know that the most important word in this query
is "Obama". 

It is only a shortcut of some ideas on how one can improve the relevancy
with several features that Solr offers out-of-the-box. Some of them could be
improved with external NLP-tools. 

My biggest problem with relevancy is, that I can't work with metadata
computed on the fly or every hour out of the box (okay, you mentioned at the
discussion on the dev-list that it may be possible, however I answered that
the feature you talked about is not well documented, so that I don't know if
it fits my needs or how to use it). 

How to avoid over- or under-tuning? 
Easily: Testing every change I made on scoring-factors against a lot of
queries. If it looks good in 9 of 10 cases in a real good way, than the 10th
case runs against a really bad query or could be solved with a facet or...
there are a lot of ideas how to solve this. What I really want to say is:
Test as much as you can and try to realize what your changes really mean
(for example I can make a boost on the title of a document with a value of
1.000, every other field has got a boost-value between 1 and 10. I am
relatively sure that this meets the needs for some queries but works
catastrophal with the rest). 
It really helps to understand how Lucene's similarity works and what those
factors mean in reality to your existing data. Maybe you need to change the
smiliarity, because you don't want that the length of a document influences
the score of it. 

Just some thougths. I don't think that I tell you much new stuff, however,
if you got any questions or want to know more about this or that, please
ask. 
Unfortunately I can't go to the ApacheCon, but hopefully it helps to give a
good presentation. 

Kind regards 
- Mitch
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Relevancy-Practices-tp765363p768902.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org