lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Sandiford <bob.sandif...@sirsidynix.com>
Subject Determining Facet Values that match the Search Term(s) - suggestions?
Date Fri, 29 Apr 2011 19:04:57 GMT
Hi, all.

We've indexed various types of documents, one of the fields we have is Author, and we are
already able to use that as a facet, choose one of the values and further narrow by that.

Now we've been given a use case that runs something like this as an example:

1)      Choose 'Author Alphabetical' search, and enter a search term(s), for example 'Steel'

2)      Have a list of the Authors matching 'Steel' come up with a count of the number of
documents associated

3)      User chooses one of those entries and then gets the document results where that Author
is present.

So - it's 'sideways' in that we want essentially present facets first, with no results, choose
a facet, and then show the results.  And - the facets we show have to match (in some fashion
- based on our Analysis chain or based on fuzzy search) the search term(s) entered.

So - I know how to get a list of all the Author Facet values for documents where 'Steel' matches
in the Author field.  The problem is - Author is a multi-valued field, and so it returns not
only the Facet values that match on 'Steel', but also all the other values from the Author
field.

I've come up with a really ugly approach that should work most of the time, but I'm hoping
someone has a better idea here...

I've read through the Facet Parameters, and searched various other places, but haven't come
across anything like this...  (I can't use the facet.prefix because I'm not looking for facets
that begin with the search term(s), I'm looking for facets that contain the search term(s)
- they could show up anywhere, and with the fuzzy handling, may not be exact matches anyways...)

Suggestions?



============================================================
For those masochists who want to know the approach I've come up with:

Search is something like this:

http://localhost:8983/solr/SD_ILS/select/?start=0&rows=5000&fl=JUNK&qt=standard&q=AUTHOR_boost:"mark
twain"~30&facet=true&facet.mincount=1&facet.sort=index&facet.limit=-1&hl=true&hl.fl=AUTHOR_boost&hl.mergecontinuous=true&hl.snippets=5&facet.field=AUTHOR_facet&sort=id
asc

Explanation:

1)      Searching for documents with the AUTHOR_boost field (our internal 'Author' field)
with search term "mark twain" with a proximity distance of "30" (somewhat arbitrary).

2)      Return facet values for AUTHOR_facet field with at least one document (AUTHOR_facet
is same as AUTHOR_boost as far as original content - just 'string' instead of 'text' to bypass
analysis)

3)      Return up to 5000 hits (this is one of the really kludgy bits) hoping that's enough
hits to span all the hits that would include "mark twain".  However, specify the "fl" (fields
to return) as a field that never exists, so only getting back empty <doc /> elements
in the xml.

4)      Also do highlighting on the AUTHOR_boost field which tells us what value(s) the search
terms were found in

5)      Sort by the document id - just as a kind of random sort to try to get as many distinct
highlighting results as possible (i.e. don't want any score type sequencing which would cluster
the highlight values)

Do some post processing:

6)      Build a set of Strings from the highlighting results - removing the highlight <em>
and </em> elements.  Intent is that this becomes the set of 'mark twain' type Strings.

7)      Chug through the facet_field list for AUTHOR_facet and preserve only those which have
an entry in the set of strings built from the highlighting results.

8)      Present that result back to the users along with the counts from the facet...

Really ugly.  But - will usually work...

To help visualize this, here's some excerpts of the response:

<response>
 <result name="response" numFound="513" start="0">
    <doc />
    <doc />
    <doc />
       ...
       <doc />
     </result>
     <lst name="facet_counts">
       <lst name="facet_fields">
         <lst name="AUTHOR_facet">
           <int name="Adams, Joseph.">1</int>
           <int name="Addy, Wesley.">1</int>
           <int name="Albee, Josh.">1</int>
           <int name="Aldana, Raul.">1</int>
           ...
           <int name="Kern, Jerome, 1885-1945. Mark Twain suite.">1</int>
           ...
           <int name="Mark Twain Media.">1</int>
           ...
           <int name="Twain, Mark, 1835-1910">212</int>
           <int name="Twain, Mark, 1835-1910, Contributor">3</int>
           <int name="Twain, Mark, 1835-1910.">244</int>
           ...
         </lst>
       </lst>
     </lst>
     <lst name="highlighting">
       <lst name="ent://SD_ILS/0/SD_ILS:331">
         <arr name="AUTHOR_boost">
           <str><em>Twain</em>, <em>Mark</em>, 1835-1910</str>
         </arr>
       </lst>
       <lst name="ent://SD_ILS/0/SD_ILS:356">
      <arr name="AUTHOR_boost">
        <str><em>Twain</em>, <em>Mark</em>, 1835-1910</str>
      </arr>
    </lst>
    ...
    <lst name="ent://SD_ILS/104/SD_ILS:104542">
      <arr name="AUTHOR_boost">
        <str><em>Twain</em>, <em>Mark</em>, 1835-1910.</str>
      </arr>
    </lst>
    <lst name="ent://SD_ILS/11/SD_ILS:11485">
      <arr name="AUTHOR_boost">
        <str><em>Twain</em>, <em>Mark</em>, 1835-1910, Contributor</str>
      </arr>
   </lst>
    ...
    <lst name="ent://SD_ILS/482/SD_ILS:482038">
      <arr name="AUTHOR_boost">
        <str>Kern, Jerome, 1885-1945. <em>Mark</em> <em>Twain</em>
suite.</str>
      </arr>
   </lst>
   ...
 </lst>
</response>


Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com
www.sirsidynix.com<http://www.sirsidynix.com>
Join the conversation - you may even get an iPad or Nook out of it!

[cid:image005.jpg@01CC066E.0A56ED60]<http://www.facebook.com/SirsiDynix>Like us on Facebook!

[cid:image006.jpg@01CC066E.0A56ED60]<http://twitter.com/#!/SirsiDynix>Follow us on Twitter!



Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message