lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Don Clore <don.cl...@5to1.com>
Subject expand synonyms without tokenizing stream?
Date Wed, 08 Jul 2009 17:09:10 GMT
I'm pretty new to solr; my apologies if this is a naive question, and my
apologies for the verbosity:
I'd like to take keywords in my documents, and expand them as synonyms; for
example, if the document gets annotated with a keyword of 'sf', I'd like
that to expand to 'San Francisco'.  (San Francisco,San Fran,SF is a line in
my synonyms.txt file).

But I also want to be able to display facets with counts for these keywords;
I'd like them to be suitable for display.

So, if I define the keywords field as 'text', I use the following pipeline
(from my schema.xml):

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">      <analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>        <filter
class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>        <filter
class="solr.StopFilterFactory"                ignoreCase="true"
        words="stopwords.txt"
enablePositionIncrements="true"                />        <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>        <filter
class="solr.LowerCaseFilterFactory"/>        <filter
class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>      <analyzer type="query">        <tokenizer
class="solr.WhitespaceTokenizerFactory"/>        <filter
class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>        <filter
class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>        <filter
class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>        <filter
class="solr.LowerCaseFilterFactory"/>        <filter
class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>    </fieldType>


Faceting on this field, I get return values (when I query specifically
for the single document in question):

      <lst name="Keywords">
        <int name="fran">1</int>
        <int name="francisco">1</int>
        <int name="san">1</int>
        <int name="sf">1</int>
      </lst>

I've also done a copyfield to a 'KeywordsString' field, which is
defined as "string". i.e.

<fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>

Faceting on *that* field (when querying for just this 1 document,
which has a keyword of 'sf'), results in:

      <lst name="KeywordsString">
        <int name="sf">1</int>
      </lst>

I guess what I'd like to see is the ability to stamp keywords like
'sf', 'san fran', 'san francisco', and 'mlb' (with a synonyms.txt file
entry of mlb => Major League Baseball, and see all the documents that
are inscribed with all those synonym variants, come back as:

      <lst name="KeywordsString">
        <int name="San Francisco">1</int>

       <int name="Major League Baseball">1</int>

</lst>


But, I don't know how to define a processing pipeline that expands
synonyms that doesn't tokenize them, breaking 'San Francisco' into
'san' and 'francisco', and presenting those as separate facets.

Thanks for any help,

Don

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message