lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves
Date Sat, 11 Sep 2010 21:23:55 GMT
This sounds like a Mahout problem, not a Lucene problem- there are some 
text analysis tools that might help you.

mark harwood wrote:
> >>What is the "best practices" formula for determining above average 
> correlations of adjacent terms
> I gave this some thought in 
> I found the Jaccard cooefficient favoured rare words too strongly and 
> so went for a blend as shown below:
>     public float getScore()
>     {
>         float overallIntersectionPercent = coIncidenceDocCount
>                 / (float) (termADocFreq + termBDocFreq);
>         float termBIntersectionPercent = coIncidenceDocCount
>                 / (float) (termBDocFreq);
>         //using just the termB intersection favours common words as
>         // coincidents eg "new" food
>         //      return termBIntersectionPercent;
>         //using just the overall intersection favours rare words as
>         // coincidents eg "scezchuan" food
>         //        return overallIntersectionPercent;
>         // so here we take an average of the two:
>         return (termBIntersectionPercent + overallIntersectionPercent) 
> / 2;
>     }
> ------------------------------------------------------------------------
> *From:* Mark Bennett <>
> *To:*
> *Sent:* Fri, 10 September, 2010 18:44:31
> *Subject:* Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves
> Thanks Mark H,
> Maybe I'll look at MLT (More Like This) again.  I'll also check out zipf.
> It's claimed that Question and Answer wording is different enough for 
> generic text content that different techniques might be indicated. 
> From what I remember:
> 1: Though nouns normally convey 60% of relevancy in general text, Q&A 
> content is skewed a bit more towards verbs.
> 2: Questions may contain more noise words (though perhaps in useful 
> groupings)
> 3: Vocabulary mismatch of Interrogative vs. declarative / narrative (Q 
> vs A)
> 4: Vocabulary mismatch of novices vs experts (Q vs A)
> It was item 2 that I was hoping to capitalize on with NGrams / Shingles.
> Still waiting for the relevancy math nerds to chime in about the 
> log-log and IDF stuff ... ;-)
> I was thinking a bit more about the math involved here....
> What is the "best practices" formula for determining above average 
> correlations of adjacent terms, beyond what random chance would give. 
> So you notice that "white" and "house" appear next to each other more 
> than what chance distribution would explain, so you decide it's an 
> important NGram.
> The "noise floor" isn't too bad for the typical shopping cart items 
> calculation.
> You analyze the items present or not present in 1,000 shopping cart 
> receipts.
>     If grocery items were completely independent then "random" level 
> is just the odds of the 2 items multiplied together:
>         1,000 shopping carts
>         200 have cereal
>         250 have milk
>     chance of
>         cereal = 200/1,000 = 20%
>         milk = 250/1,000 = 25%
>     IF independent then
>         P(cereal AND milk) = P(cereal) * P(milk)
>         20% * 25% = 5%
>         So 50 carts likely to have both cereal and milk
>         And if MORE than 50 carts have cereal and milk, then it's 
> worth noting.
> The classic example is diapers and beer, which is a bit apocryphal and 
> NOT expected, but I like the breakfast cereal and milk example better 
> because it IS expected.
> Now back to word-A appearing directly before word-B, and finding the 
> base level number of times you'd expect just from random chance.
> Although Lucene/Luke gives you total word instances and document 
> counts, what you'd really want is the number of possible N-Grams, 
> which is affected by document boundaries, so it gets a little weird.
> Some other differences between the word-A word-B calculation vs milk 
> and cereal:
> 1: I want ordered pairs, "white" before "house"
> 2: A document is NOT like a shopping cart in that I DO care how many 
> times "white" appears before "house", whereas in the shopping carts I 
> only cared about present or not present, so document count is less 
> helpful here.
> I'm sure some companies and PHD's have super secret formulas for this, 
> but I'd be content to just compare it to baseline random chance.
> Mark B
> --
> Mark Bennett / New Idea Engineering, Inc. / 
> <>
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> On Fri, Sep 10, 2010 at 3:17 AM, mark harwood < 
> <>> wrote:
>     Hi Mark
>     I've played with Shingles recently in some auto-categorisation
>     work where my starting assumption was that multi-word terms will
>     hold more information value than individual words and that phrase
>     queries on seperate terms will not give these term combos their
>     true reward (in terms of IDF) - or if they did compute the true
>     IDF,  would require lots of disk IO to do this. Shingles present a
>     conveniently pre-aggregated score for these combos.
>     Looking at the results of MoreLikeThis queries based on a
>     shingling analyzers the results I saw generally seemed good but
>     did not formally bench mark this against non-shingled indexes. Not
>     everything was rosy in that I did see some tendency to over-reward
>     certain rare shingles (e.g. a shared mention of "New Years Eve
>     Party" pulled otherwise mostly unrelated news articles together).
>     This led me to look at using the links in resulting documents to
>     help identify clusters of on-topic and potentially off-topic
>     results to tune these discrepancies out but that's another topic.
>     BTW, the Luke tool has a "Zipf" plugin that you may find useful in
>     examining index term distributions in Lucene indexes..
>     Cheers
>     Mark
>     ------------------------------------------------------------------------
>     *From:* Mark Bennett <
>     <>>
>     *To:* <>
>     *Sent:* Fri, 10 September, 2010 1:42:11
>     *Subject:* Relevancy, Phrase Boosting, Shingles and Long Tail Curves
>     I want to boost the relevancy of some Question and Answer content.
>     I'm using stop words, Dismax, and I'm already a fan of Phrase
>     Boosting and have cranked that up a bit. But I'm considering using
>     long Shingles to make use of some of the normally stopped out
>     "junk words" in the content to help relevancy further.
>     Reminder: "Shingles" are artificial tokens created by gluing
>     together adjacent words.
>         Input text: This is a sentence
>         Normal tokens: this, is, a, sentence  (without removing stop
>     words)
>         2+3 word shingles: this-is, is-a, a-sentence, this-is-a,
>     is-a-sentence
>     A few questions on relevance and shingles:
>     1: How similar are the relevancy calculations compare between
>     Shingles and exact phrases?
>     I've seen material saying that shingles can give better
>     performance than normal phrase searching, and I'm assuming this is
>     exact phrase (vs. allowing for phrase slop)
>     But do the relevancy calculations for normal exact phrase and
>     Shingles wind up being *identical*, for the same documents and
>     searches?  That would seem an unlikely coincidence, but possibly
>     it could have been engineered to intentionally behave that way.
>     2: What's the latest on Shingles and Dismax?
>     The low front end low level tokenization in Dismax would seem to
>     be a problem, but does the new parser stuff help with this?
>     3: I'm thinking of a minimum 3 word shingle, does anybody have
>     comments on shingle length?
>     Eyeballing the 2 word shingles, they don't seem much better than
>     stop words.  Obviously my shingle field bypasses stop words.
>     But the 3 word shingles start to look more useful, expressing more
>     intent, such as "how do i", "do i need" and "it work with", etc.
>     Has there been any Lucene/Solr studies specifically on shingle length?
>     and finally...
>     4: Is it useful to examine your token occurrences against a
>     Power-Law log-log curve?
>     So, with either single words, or shingles, you do a histogram, and
>     then plot the histogram in an X-Y graph, with both axis being
>     logarithmic. Then see if the resulting graph follows (or diverges)
>     from a straight line.  This "Long Tail" / Pareto / powerlaw
>     mathematics were very popular a few years ago for looking at
>     histograms of DVD rentals and human activities, and prior to the
>     web, the power law and 80/20 rules has been observed in many other
>     situations, both man made and natural.
>     Also of interest, when a distribution is expected to follow a
>     power line, but the actual data deviates from that theoretical
>     line, then this might indicate some other factors at work, or so
>     the theory goes.
>     So if users' searches follow any type of histogram with a hidden
>     powerlaw line, then it makes sense to me that the source content
>     might also follow a similar distribution.  Is the normal IDF
>     ranking inspired by that type of curve?
>     And *if* word occurrences, in either searches or source documents,
>     were expected to follow a power law distribution, then possible
>     shingles would follow such a curve as well.
>     Thinking that document text, like many other things in nature,
>     might follow such a curve, I used the Lucene index to generate
>     such a curve. And I did the same thing for 3 word tokens. The 2
>     curves do have different slopes, but neither is very straight.
>     So I was wondering if anybody else has looked at IDF curves
>     (actually non-inverted document frequency curves) or raw word
>     instance counts and power law graphs?  I haven't found a smoking
>     gun in my online searches, but I'm thinking some of you would know
>     this.
>     --
>     Mark Bennett / New Idea Engineering, Inc. /
>     <>
>     Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message