lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Highlighting - catering for all query types
Date Mon, 19 Oct 2009 10:29:52 GMT
I've been putting together some code to support highlighting of opaque query clauses (cached
filters, trie range, spatial etc etc) which shows some promise.

This is not intended as a replacement for the existing highlighter(s) which deal with free-text
but is instead concentrating on the hard-to-highlight clauses and has the benefit of working
in-line with the query process.
Summarisation is not a requirement here - I simply need to know if a given query clause matched
on a result.

The approach I have come up with is to wrap query clauses with lightweight (processing and
RAM-wise) instrumenting objects in order to record which clauses matched.
The recorded matches are encoded as a byte in the document score which unfortunately requires
some loss of precision in the scores - more on this later.

The general approach for use looks like this:

        //Wrap *any* type of query object for highlight flagging and allocate a flag number
between 1 and 8 for the clauses of interest....
        FlagRecordingQuery frqA=new FlagRecordingQuery(new TermQuery(new Term("statusField","published")),1);
        FlagRecordingQuery frqB=new FlagRecordingQuery(new XyzLtd3rdPartyQuery("imageDataField",
"unknown magic to find 'sunset'")),2);

        BooleanQuery bq=new BooleanQuery();
        bq.add(new BooleanClause(frqA,Occur.SHOULD));
        bq.add(new BooleanClause(frqB,Occur.SHOULD));

        //Parent query must be a FlagCombiningQuery to encode child match info in the doc
scores
        FlagCombiningQuery fcq=new FlagCombiningQuery(bq);

        //Run search
        TopDocs td = s.search(fcq,10);
        ScoreDoc[] sd = td.scoreDocs;
        for (ScoreDoc scoreDoc : sd)
        {
            float score=scoreDoc.score;

            //Check to see which flags are encoded in the score.
            if(FlagCombiningQuery.hasFlag(1, score))
            {
                System.out.println("woot! "+scoreDoc.doc+" matched clause 1 ");
            }
            if(FlagCombiningQuery.hasFlag(2, score))
            {
                System.out.println("woot! "+scoreDoc.doc+" matched clause 2 ");
            }
        }


The FlagRecordingQuery child clauses introduce themselves to the FlagCombiningQuery through
a thread local at "rewrite" time.
The FlagCombiningQuery at the root adjusts the scores as follows:

        static final float DEFAULT_MULTIPLIER=1000f;
        float multiplier=DEFAULT_MULTIPLIER;
    ....
        public float score() throws IOException
        {
            float score = delegateScorer.score();
            byte flags=0;
            int d=doc();
            //encode all matched child clauses into a "flags" byte.
            for (FlagRecordingQuery frq : thisThreadsFlags)
            {
                if(frq.matched(d))
                {
                    byte mask=flagMasks[frq.flag-1];
                    flags=setFlag(flags, mask);
                }
            }

            //Multiply score to turn float into int with sufficient fractions in score.
            int shiftedI=(int) (score*multiplier);
            //Shift int to make space for byte holding flags
            int iPlusSpaceForByte=shiftedI<<8;
            //Add match flags
            int iCombinedScoreAndFlags=iPlusSpaceForByte|flags;
            System.out.println("combined score="+iCombinedScoreAndFlags+" for doc#"+doc());
            return iCombinedScoreAndFlags;
        }

The mechanism works but relies on original score values that :
a) Are not too big - i.e. do not lose significant digits when multiplied by "multiplier" and
then shifted left 8 bits.
b) Are not too similar - i.e. only differ in very small fractions e.g. all scores occur in
the range 0.1234 to 0.1235

To give an indication of restrictions this imposes here are the usable score ranges for various
settings of "multiplier":

multiplier       max score   fraction precision
======   ========   =============
10           838860         0.x
100         83886              0.xx
1000       8388             0.xxx
10000     838               0.xxxx

I would imagine the majority of Lucene query results would still rank sensibly with a 1,000
or 10,000 multiplier.

However, all this potentially dangerous bit twiddling could of course be avoided if the Lucene
search API was expanded to include docid, score AND a completely seperate field for recording
match flags. 


Thoughts?


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message