lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Lincoln <josh.linc...@gmail.com>
Subject Re: Search for ISBN-like identifiers
Date Thu, 05 Jan 2017 18:57:00 GMT
Sebastian,
You may want to try adding autoGeneratePhraseQueries="true" to the
fieldtype.
With that setting, a query for 978-3-8052-5094-8 will behave just like "978
3 8052 5094 8" (with the quotes)

A few notes about autoGeneratePhraseQueries
a) it used to be set to true by default, but that was changed several years
ago
b) does NOT require a reindex, so very easy to test
c) apparently not recommended for non-whitespace delimited languages (CJK,
etc), but maybe that's not an issue in your use case.
d) i'm unsure how it'll impact wildcard queries on that field. E.g. will
978-3-8052* match 978-3-8052-5094-8? At the very least, partial ISBNs (e.g.
978-3-8052) would match full ISBN without needing to use the wildcard. I'm
just not sure what happens if the user includes the wildcard.

Josh

On Thu, Jan 5, 2017 at 1:41 PM Sebastian Riemer <s.riemer@littera.eu> wrote:

> Thank you very much for taking the time to help me!
>
> I'll definitely have a look at the link you've posted.
>
> @ShawnHeisey Thanks too for shedding light on the wildcard behaviour!
>
> Allow me one further question:
> - Assuming that I define a separate field for storing the ISBNs, using the
> awesome analyzer provider by Mr. Bill Dueber. How do I get that field
> copied into my general text field, which is used by my QuickSearch-Input?
> Won't that field be processed again by the analyser defined on the text
> field?
> - Should I alternatively add more fields to the q-Parameter? As for now, I
> always have set q=text:<whatever_I_want_to_search_here> but I guess one
> could try something like
> q=text:<whatever_i_want_to_search>+isbnspeciallookupfield:<whatever_i_want_to_search>
>
> I don't really know about that last idea though, since the searches are
> propably OR-combined which is not what I like to have.
>
> Third option would be, to pre-process the distinction to where to look at
> in the solr in my application of course. I.e. everything being a regex
> containing only numbers and hyphens with length 13 -> don't query on field
> text, instead use field isbnspeciallookupfield
>
>
> Many thanks again, and have a nice day!
> Sebastian
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erik Hatcher [mailto:erik.hatcher@gmail.com]
> Gesendet: Donnerstag, 5. Januar 2017 19:10
> An: solr-user@lucene.apache.org
> Betreff: Re: Search for ISBN-like identifiers
>
> Sebastian -
>
> There’s some precedent out there for ISBN’s.  Bill Dueber and the
> UMICH/code4lib folks have done amazing work, check it out here -
>
>         https://github.com/mlibrary/umich_solr_library_filters <
> https://github.com/mlibrary/umich_solr_library_filters>
>
>   - Erik
>
>
> > On Jan 5, 2017, at 5:08 AM, Sebastian Riemer <s.riemer@littera.eu>
> wrote:
> >
> > Hi folks,
> >
> >
> > TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general
> text field, respectively configure the analyser on that field, so that a
> search for the hyphenated ISBN returns exactly the matching document?
> >
> > Long version:
> > I've defined a field "text" of type "text_general", where I copy all
> > my other fields to, to be able to do a "quick search" where I set
> > q=text
> >
> > The definition of the type text_general is like this:
> >
> >
> >
> > <fieldType name="text_general" class="solr.TextField"
> > positionIncrementGap="100">
> >
> >      <analyzer type="index">
> >
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" />
> >
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >
> >      </analyzer>
> >
> >      <analyzer type="query">
> >
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >
> >        <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" />
> >
> >        <filter class="solr.SynonymFilterFactory"
> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >
> >      </analyzer>
> >
> >    </fieldType>
> >
> >
> > I now face the problem, that searching for a book with
> > text:978-3-8052-5094-8* does not return the single result I expect.
> > However searching for text:9783805250948* instead returns a result.
> > Note, that I am adding a wildcard at the end automatically, to further
> > broaden the resultset. Note also, that it does not seem to matter
> > whether I put backslashes in front of the hyphen or not (to be exact,
> > when sending via SolrJ from my application, I put in the backslashes,
> > but I don't see a difference when using SolrAdmin as I guess SolrAdmin
> > automatically inserts backslashes if needed?)
> >
> > When storing ISBNs, I do store them twice, once with hyphens
> (978-3-8052-5094-8) and once without (9783805250948). A pure phrase search
> on both those values return also the single document.
> >
> > I learned that the StandardTokenizer splits up values from fields at
> index time, and I've also learned that I can use the solrAdmin analysis and
> the debugQuery to help understand what is going on. From the analysis
> screen I see, that given the value 9783805250948 at index-time and
> 9783805250948* query-time both leads to an unchanged value 9783805250948 at
> the end.
> > When given the value 978-3-8052-5094-8 for "Field Value (Index)" and
> 978-3-8052-5094-8* for "Field Value (Query)"  I can see how the ISBN is
> tokenized into 5 parts. Again, the values match on both sides (Index and
> Query).
> >
> > How does the left side correlate with the right side? My guess: The left
> side means, "Values stored in field text will be tokenized while indexing
> as show here on the left". The right side means, "When querying on the
> field text, I'll tokenize the entered value like this, and see if I find
> something on the index" Is this correct?
> >
> > Another question: when querying and investigating the single document in
> solrAdmin, the contents I see In the column text represents the _stored_
> value of the field text, right?
> > And am I correct that this actually has nothing to do, with what is
> actually stored in  the index for searching?
> >
> > When storing the value 978-3-8052-5094-8, are only the tokenized values
> stored for search, or is the "whole word" also stored? Is there a way to
> actually see all the values which are stored for search?
> > When searching text:" 978-3-8052-5094-8" I get the single result, so I
> guess the value as a whole must also be stored in the index for searching?
> >
> > One more thing which confuses me:
> > Searching for text: 978-3-8052-5094-8 gives me 72 results, because it
> > leads to searching for "parsedquery_toString":"text:978 text:3
> > text:8052 text:5094 text:8", but searching for text:
> > 978-3-8052-5094-8* gives me 0 results, this leads to
> > "parsedquery_toString":"text:978-3-8052-5094-8*",
> >
> > Why is the appended wildcard changing the behaviour so radically? I'd
> rather expect to get something like "parsedquery_toString":"text:978 text:3
> text:8052 text:5094 text:8*",  and thus even more results.
> >
> > Btw. I've found and read an interesting blog about storing ISBNs and
> alikes here:
> http://robotlibrarian.billdueber.com/2012/03/solr-field-type-for-numericish-ids/
> However, I already store my ISBN also in a separate field, of type string,
> which works fine when I use this field for searching.
> >
> > Best regards, sorry for the enormously long question and thank you for
> listening.
> >
> > Sebastian
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message