lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bennett <mbenn...@ideaeng.com>
Subject Relevancy, Phrase Boosting, Shingles and Long Tail Curves
Date Fri, 10 Sep 2010 00:42:11 GMT
I want to boost the relevancy of some Question and Answer content. I'm using
stop words, Dismax, and I'm already a fan of Phrase Boosting and have
cranked that up a bit. But I'm considering using long Shingles to make use
of some of the normally stopped out "junk words" in the content to help
relevancy further.

Reminder: "Shingles" are artificial tokens created by gluing together
adjacent words.
    Input text: This is a sentence
    Normal tokens: this, is, a, sentence  (without removing stop words)
    2+3 word shingles: this-is, is-a, a-sentence, this-is-a, is-a-sentence

A few questions on relevance and shingles:

1: How similar are the relevancy calculations compare between Shingles and
exact phrases?

I've seen material saying that shingles can give better performance than
normal phrase searching, and I'm assuming this is exact phrase (vs. allowing
for phrase slop)

But do the relevancy calculations for normal exact phrase and Shingles wind
up being *identical*, for the same documents and searches?  That would seem
an unlikely coincidence, but possibly it could have been engineered to
intentionally behave that way.

2: What's the latest on Shingles and Dismax?

The low front end low level tokenization in Dismax would seem to be a
problem, but does the new parser stuff help with this?

3: I'm thinking of a minimum 3 word shingle, does anybody have comments on
shingle length?

Eyeballing the 2 word shingles, they don't seem much better than stop
words.  Obviously my shingle field bypasses stop words.

But the 3 word shingles start to look more useful, expressing more intent,
such as "how do i", "do i need" and "it work with", etc.

Has there been any Lucene/Solr studies specifically on shingle length?

and finally...

4: Is it useful to examine your token occurrences against a Power-Law
log-log curve?

So, with either single words, or shingles, you do a histogram, and then plot
the histogram in an X-Y graph, with both axis being logarithmic. Then see if
the resulting graph follows (or diverges) from a straight line.  This "Long
Tail" / Pareto / powerlaw mathematics were very popular a few years ago for
looking at histograms of DVD rentals and human activities, and prior to the
web, the power law and 80/20 rules has been observed in many other
situations, both man made and natural.

Also of interest, when a distribution is expected to follow a power line,
but the actual data deviates from that theoretical line, then this might
indicate some other factors at work, or so the theory goes.

So if users' searches follow any type of histogram with a hidden powerlaw
line, then it makes sense to me that the source content might also follow a
similar distribution.  Is the normal IDF ranking inspired by that type of
curve?

And *if* word occurrences, in either searches or source documents, were
expected to follow a power law distribution, then possible shingles would
follow such a curve as well.

Thinking that document text, like many other things in nature, might follow
such a curve, I used the Lucene index to generate such a curve. And I did
the same thing for 3 word tokens. The 2 curves do have different slopes, but
neither is very straight.

So I was wondering if anybody else has looked at IDF curves (actually
non-inverted document frequency curves) or raw word instance counts and
power law graphs?  I haven't found a smoking gun in my online searches, but
I'm thinking some of you would know this.


--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Mime
View raw message