lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves
Date Fri, 10 Sep 2010 10:17:59 GMT
Hi Mark
I've played with Shingles recently in some auto-categorisation work where my 
starting assumption was that multi-word terms will hold more information value 
than individual words and that phrase queries on seperate terms will not give 
these term combos their true reward (in terms of IDF) - or if they did compute 
the true IDF,  would require lots of disk IO to do this. Shingles present a 
conveniently pre-aggregated score for these combos.
Looking at the results of MoreLikeThis queries based on a shingling analyzers 
the results I saw generally seemed good but did not formally bench mark this 
against non-shingled indexes. Not everything was rosy in that I did see some 
tendency to over-reward certain rare shingles (e.g. a shared mention of "New 
Years Eve Party" pulled otherwise mostly unrelated news articles together). This 
led me to look at using the links in resulting documents to help identify 
clusters of on-topic and potentially off-topic results to tune these 
discrepancies out but that's another topic.
BTW, the Luke tool has a "Zipf" plugin that you may find useful in examining 
index term distributions in Lucene indexes..

Cheers
Mark


________________________________
From: Mark Bennett <mbennett@ideaeng.com>
To: java-dev@lucene.apache.org
Sent: Fri, 10 September, 2010 1:42:11
Subject: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

I want to boost the relevancy of some Question and Answer content. I'm using 
stop words, Dismax, and I'm already a fan of Phrase Boosting and have cranked 
that up a bit. But I'm considering using long Shingles to make use of some of 
the normally stopped out "junk words" in the content to help relevancy further.

Reminder: "Shingles" are artificial tokens created by gluing together adjacent 
words.
    Input text: This is a sentence
    Normal tokens: this, is, a, sentence  (without removing stop words)
    2+3 word shingles: this-is, is-a, a-sentence, this-is-a, is-a-sentence

A few questions on relevance and shingles:

1: How similar are the relevancy calculations compare between Shingles and exact 
phrases?

I've seen material saying that shingles can give better performance than normal 
phrase searching, and I'm assuming this is exact phrase (vs. allowing for phrase 
slop)

But do the relevancy calculations for normal exact phrase and Shingles wind up 
being *identical*, for the same documents and searches?  That would seem an 
unlikely coincidence, but possibly it could have been engineered to 
intentionally behave that way.

2: What's the latest on Shingles and Dismax?

The low front end low level tokenization in Dismax would seem to be a problem, 
but does the new parser stuff help with this?

3: I'm thinking of a minimum 3 word shingle, does anybody have comments on 
shingle length?

Eyeballing the 2 word shingles, they don't seem much better than stop words.  
Obviously my shingle field bypasses stop words.

But the 3 word shingles start to look more useful, expressing more intent, such 
as "how do i", "do i need" and "it work with", etc.

Has there been any Lucene/Solr studies specifically on shingle length?

and finally...

4: Is it useful to examine your token occurrences against a Power-Law log-log 
curve?

So, with either single words, or shingles, you do a histogram, and then plot the 
histogram in an X-Y graph, with both axis being logarithmic. Then see if the 
resulting graph follows (or diverges) from a straight line.  This "Long Tail" / 
Pareto / powerlaw mathematics were very popular a few years ago for looking at 
histograms of DVD rentals and human activities, and prior to the web, the power 
law and 80/20 rules has been observed in many other situations, both man made 
and natural.

Also of interest, when a distribution is expected to follow a power line, but 
the actual data deviates from that theoretical line, then this might indicate 
some other factors at work, or so the theory goes.

So if users' searches follow any type of histogram with a hidden powerlaw line, 
then it makes sense to me that the source content might also follow a similar 
distribution.  Is the normal IDF ranking inspired by that type of curve?

And *if* word occurrences, in either searches or source documents, were expected 
to follow a power law distribution, then possible shingles would follow such a 
curve as well.

Thinking that document text, like many other things in nature, might follow such 
a curve, I used the Lucene index to generate such a curve. And I did the same 
thing for 3 word tokens. The 2 curves do have different slopes, but neither is 
very straight.

So I was wondering if anybody else has looked at IDF curves (actually 
non-inverted document frequency curves) or raw word instance counts and power 
law graphs?  I haven't found a smoking gun in my online searches, but I'm 
thinking some of you would know this.


--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513



      
Mime
View raw message