lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Search for ISBN-like identifiers
Date Thu, 05 Jan 2017 17:38:47 GMT
bq: How does the left side correlate with the right side?...

You've got it right, the left is the indexed and the right is the query

bq: the contents I see In the column text represents the _stored_
value of the field text, right...

Correct

bq: ...are only the tokenized values stored for search....

I'll be a bit pedantic here since "stored" is overloaded ;)...

The _indexed_ tokens, i.e. the tokens you search against are all
that's searchable. For instance let's say you have "running" in your
text and are stemming. "run" is all that gets into the searchable
portion of your index.

there's no really convenient way to find the tokens associated with a
doc, the inverted index structure doesn't lent itself well to
reconstructing a doc that way. Luke _can_ do this. It's a lossy
process as you'll see. It can also be quite lengthy.

bq: One more thing which confuses me:

Oh boy. All I can offer here is it's less confusing that it was in
"the bad old days". Wildcards are tricky to handle. Here's a writeup:
https://lucidworks.com/blog/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

The short form is that wildcards are handled "specially" and much of
the analysis chain will be skipped, it depends on the particular
class. Your trailing wildcard example makes sense to a human, but it
turns out to be hard to generalize.

Two possibilities for you to consider, especially since ISBNs are regular:
1> WordDelimiterFilterFactory is designed for this kind of thing. You
can dothings like "catenateNumbers" so what'd be searchable would be
both "978-3-8052-5094-8" and 9783805250948

2> do the above yourself in the ETL process. Then just use a
multiValued String field.

Best,
Erick

On Thu, Jan 5, 2017 at 2:08 AM, Sebastian Riemer <s.riemer@littera.eu> wrote:
> Hi folks,
>
>
> TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general text field, respectively
configure the analyser on that field, so that a search for the hyphenated ISBN returns exactly
the matching document?
>
> Long version:
> I've defined a field "text" of type "text_general", where I copy all my other fields
to, to be able to do a "quick search" where I set q=text
>
> The definition of the type text_general is like this:
>
>
>
> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>
>       <analyzer type="index">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
>       <analyzer type="query">
>
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
/>
>
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
>
>         <filter class="solr.LowerCaseFilterFactory"/>
>
>       </analyzer>
>
>     </fieldType>
>
>
> I now face the problem, that searching for a book with text:978-3-8052-5094-8* does not
return the single result I expect. However searching for text:9783805250948* instead returns
a result. Note, that I am adding a wildcard at the end automatically, to further broaden the
resultset. Note also, that it does not seem to matter whether I put backslashes in front of
the hyphen or not (to be exact, when sending via SolrJ from my application, I put in the backslashes,
but I don't see a difference when using SolrAdmin as I guess SolrAdmin automatically inserts
backslashes if needed?)
>
> When storing ISBNs, I do store them twice, once with hyphens (978-3-8052-5094-8) and
once without (9783805250948). A pure phrase search on both those values return also the single
document.
>
> I learned that the StandardTokenizer splits up values from fields at index time, and
I've also learned that I can use the solrAdmin analysis and the debugQuery to help understand
what is going on. From the analysis screen I see, that given the value 9783805250948 at index-time
and 9783805250948* query-time both leads to an unchanged value 9783805250948 at the end.
> When given the value 978-3-8052-5094-8 for "Field Value (Index)" and 978-3-8052-5094-8*
for "Field Value (Query)"  I can see how the ISBN is tokenized into 5 parts. Again, the values
match on both sides (Index and Query).
>
> How does the left side correlate with the right side? My guess: The left side means,
"Values stored in field text will be tokenized while indexing as show here on the left". The
right side means, "When querying on the field text, I'll tokenize the entered value like this,
and see if I find something on the index" Is this correct?
>
> Another question: when querying and investigating the single document in solrAdmin, the
contents I see In the column text represents the _stored_ value of the field text, right?
> And am I correct that this actually has nothing to do, with what is actually stored in
 the index for searching?
>
> When storing the value 978-3-8052-5094-8, are only the tokenized values stored for search,
or is the "whole word" also stored? Is there a way to actually see all the values which are
stored for search?
> When searching text:" 978-3-8052-5094-8" I get the single result, so I guess the value
as a whole must also be stored in the index for searching?
>
> One more thing which confuses me:
> Searching for text: 978-3-8052-5094-8 gives me 72 results, because it leads to searching
for "parsedquery_toString":"text:978 text:3 text:8052 text:5094 text:8",
> but searching for text: 978-3-8052-5094-8* gives me 0 results, this leads to "parsedquery_toString":"text:978-3-8052-5094-8*",
>
> Why is the appended wildcard changing the behaviour so radically? I'd rather expect to
get something like "parsedquery_toString":"text:978 text:3 text:8052 text:5094 text:8*", 
and thus even more results.
>
> Btw. I've found and read an interesting blog about storing ISBNs and alikes here: http://robotlibrarian.billdueber.com/2012/03/solr-field-type-for-numericish-ids/
However, I already store my ISBN also in a separate field, of type string, which works fine
when I use this field for searching.
>
> Best regards, sorry for the enormously long question and thank you for listening.
>
> Sebastian

Mime
View raw message