lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: Using dismax to find multiple terms across multiple fields
Date Thu, 30 Nov 2006 20:02:05 GMT
On 11/30/06, Stephanie Belton <solr@zizou.net> wrote:
> I am using Solr to index and search documents in Russian. I have successfully set up
the RussianAnalyzer but found that it eliminates some tokens such as numbers.

You can get better control (and avoid having numbers removed)
by using TokenFilters instead of analyzers.

You might be able to use the Porter stemmer for Russian (but I don't
know how it compares to the other you are using):

    <filter class="solr.SnowballPorterFilterFactory" language="Russian" />

Here is a portion of the code from RussianAnalyzer.java:
    public TokenStream tokenStream(String fieldName, Reader reader)
    {
        TokenStream result = new RussianLetterTokenizer(reader, charset);
        result = new RussianLowerCaseFilter(result, charset);
        result = new StopFilter(result, stopSet);
        result = new RussianStemFilter(result, charset);
        return result;
    }

You could easily create FilterFactories for these Russian specific
ones, and then
gain the ability to use them just like the other factories included in Solr.

It's probably the RussianLetterTokenizer that is throwing away numbers.
Assuming russian uses normal whitespace, you might be able to use the
WhitespaceTokenizer instead.


> I would also like the search to only return ads where every single term of the query
was found across my 3 fields (title, body, location). I can't seem to get this to work.  When
I do a search for '1970', it works fine and returns 2 ads containing 1970. If I search for
'Ташкент' I get 3 results incl. one with Russian stemming (Ташкента). But when
I do a search for '1970 Ташкента' it seems to ignore 1970 and give me the same results
as only looking for 'Ташкент'. I got it to display the debug info and 1970 seems to
be ignored in the matching:

You are including the russian stemmed fields in the dismax query, and
the analysis of those fields discards numbers, hence 1970 is ignored,
right?  Either querying only the literals, or fixing the stemmed text
to not discard numbers may help (or get you further along at least).


-Yonik


> <lst name="debug">
>  <str name="rawquerystring">"1970 Ташкент"</str>
>  <str name="querystring">"1970 Ташкент"</str>
>  <str name="parsedquery">+DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 |
body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970
ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01)
DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент"~100
| title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"~100^0.5 | location_ru_RU:ташкент^0.4
| title_literal:"1970 ташкент"~100^1.5)~0.01)</str>
>  <str name="parsedquery_toString">+(body_ru_RU:ташкент^0.8 | body_literal:"1970
ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5
| location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01 (body_ru_RU:ташкент^0.8
| body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970
ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01</str>
>  <lst name="explain">
>   <str name="id=€#26;੥,internal_docid=4">
> 0.7263521 = (MATCH) sum of:
>   0.36317605 = (MATCH) max plus 0.01 times others of:
>     0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
>       0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
>         0.4 = boost
>         4.4965076 = idf(docFreq=2)
>         0.044906225 = queryNorm
>       4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
>         1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
>         4.4965076 = idf(docFreq=2)
>         1.0 = fieldNorm(field=location_ru_RU, doc=4)
>   0.36317605 = (MATCH) max plus 0.01 times others of:
>     0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
>       0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
>         0.4 = boost
>         4.4965076 = idf(docFreq=2)
>         0.044906225 = queryNorm
>       4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
>         1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
>         4.4965076 = idf(docFreq=2)
>         1.0 = fieldNorm(field=location_ru_RU, doc=4)
> </str>
>   <str name="id=€#26;ી,internal_docid=9">
> 0.7263521 = (MATCH) sum of:
>   0.36317605 = (MATCH) max plus 0.01 times others of:
>     0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
>       0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
>         0.4 = boost
>         4.4965076 = idf(docFreq=2)
>         0.044906225 = queryNorm
>       4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
>         1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
>         4.4965076 = idf(docFreq=2)
>         1.0 = fieldNorm(field=location_ru_RU, doc=9)
>   0.36317605 = (MATCH) max plus 0.01 times others of:
>     0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
>       0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
>         0.4 = boost
>         4.4965076 = idf(docFreq=2)
>         0.044906225 = queryNorm
>       4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
>         1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
>         4.4965076 = idf(docFreq=2)
>         1.0 = fieldNorm(field=location_ru_RU, doc=9)
> </str>
>   <str name="id=€#26;੕,internal_docid=2">
> 0.43162674 = (MATCH) sum of:
>   0.21581337 = (MATCH) max plus 0.01 times others of:
>     0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
>       0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
>         0.8 = boost
>         4.901973 = idf(docFreq=1)
>         0.044906225 = queryNorm
>       1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
>         1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
>         4.901973 = idf(docFreq=1)
>         0.25 = fieldNorm(field=body_ru_RU, doc=2)
>   0.21581337 = (MATCH) max plus 0.01 times others of:
>     0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
>       0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
>         0.8 = boost
>         4.901973 = idf(docFreq=1)
>         0.044906225 = queryNorm
>       1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
>         1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
>         4.901973 = idf(docFreq=1)
>         0.25 = fieldNorm(field=body_ru_RU, doc=2)
> </str>
>  </lst>
Mime
View raw message