lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Wildcard searches????
Date Fri, 05 Feb 2010 22:44:47 GMT
Fuad,

I think that you took Niclas requirements backwards.  He wants a reverse
wild-card search where the wildcard is in the document and the search query
is more specific.

You are correct that leading wildcard is critical here.

On Fri, Feb 5, 2010 at 2:25 PM, Digy <digydigy@gmail.com> wrote:

> http://en.wikipedia.org/wiki/Crossposting
>
> -----Original Message-----
> From: Niclas Rothman [mailto:niro@lechill.com]
> Sent: Saturday, February 06, 2010 12:12 AM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
>
> Hi Fuad and thanks for your reply!
>
> The first post I know now was a wrong approach, I should not have the
> wildcard included in my index.
>
> However, I can't do as you suggest, to have the full user agent in the
> index, that’s the whole idea actually.
>
> The reason can be explained like this, device manufactures are literally
> spitting out new devices and updates all the time which generates new user
> agents that are very similar, perhaps only a small version number differs.
> So what I need is to have a "generalization" of the user agent in  my
> index, to only have the start of the useragent without including the
> versions numbers.
> This way my index are all the time "up to date" even if users with new
> version numbers access my search service, which in my app isn’t significant
> but instead causing my problems....
>
> Example:
>
> I have 2 Indexed documents where the documents useragent field are partial:
> <doc>
>        <id>1</id>
>        <useragents>
>        Firefox
>            Mozilla/4.0+SonyEricsson
>        </useragents>
> </doc>
> <doc>
>        <id>2</id>
>        <useragents>
>        Firefox
>            Mozilla/4.0+SonyEricsson
>        </useragents>
> </doc>
>
> User A searches my app with an user agent as:
>
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
>
> The search app will display both document 1 and 2, because his user agent
> starts exactly has the user agent pattern in my document.
>
>
> User B searches my app with an user agent as (Please note that this user
> agent differs in the near end from Users A (JP9.5.1 instead of JP8.4.1)):
>
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0
>
> The search app will also display both document 1 and 2, because his user
> agent starts exactly has the user agent pattern in my document.
> Even if the version number of the java platform differs between user A and
>  B.
>
> If we now have a different index with FULL user agents, only User A would
> have documents returned, none of the documents user agents matched Users B
> user agent because of the "silly" version number!!
>
> <doc>
>        <id>1</id>
>        <useragents>
>        Firefox
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
>        </useragents>
> </doc>
> <doc>
>        <id>2</id>
>        <useragents>
>        Firefox
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
>        </useragents>
> </doc>
>
> Can you see my problem?
> So the basic thing is if I somehow can do a query saying that at match
> should take place if a document useragent starts with the value of the users
> useragent.
>
> In theory, having a startsWith "function / locig are easy enough to
> implement in C# / T-SQL,  but how on earth should I do this in SolR /
> Lucene?????
>
> Regards
>
> Niclas
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: 05 February 2010 22:49
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
>
> Niclas,
>
> I looked at your initial post, you are creating document with field "abc*"
> - nothing related to "wildcard query"!
>
> Of course, query [useragents:abcdefghijklm] will return no results, and
> [q=useragents:abc] no results, but [q=useragents:abc*] will return
> something.
>
> text_nav is specific SOLR type for _leading_ wildcard queries; you don't
> need it (you don't need _leading_ wildcard queries).
>
> On indexing time, instead of
> <doc>
> <useragents>
>                Firefox*
>                Mozilla/4.0*
> </useragents>
> </doc>
>
>
> You should index
> <doc>
> <useragents>
>
>  Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> </useragents>
> </doc>
>
> And also, you need to choose properly SOLR type; for instance, textTight or
> textgen, or even non-tokenized string!
>
>
> And, query [q=useragents:moz*] will return this document (even if this
> field is nontokenized).
>
>
> -Fuad
>
>
> P.S. Don't use * when you create Lucene document; use it as part of query.
>
>
>
>
> > -----Original Message-----
> > From: Niclas Rothman [mailto:niro@lechill.com]
> > Sent: February-05-10 4:44 PM
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Ted im using SOLR, but I cant figure out what type of fieldtype I should
> > use to get a query like this to work:
> >
> >
> > q=useragents: abcdefghijklm
> >
> >
> > where I have in my index one document with value "abc" in field
> > "useragents"
> >
> > That query results in 0 hits.
> >
> > If I issue this I get 1 hit of course (exact mathch)
> >
> > q=useragents: Mozilla
> >
> >
> > My document definition in SOLR looks like:
> >
> > <fields>
> >     <field name="id" type="tint" indexed="true" stored="true"
> > required="true" />
> >     <field name="useragents" type="text_rev" indexed="true"
> > stored="true" required="false" multiValued="true" />
> > </fields>
> >
> > Any clue?
> >
> > Nic
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: 05 February 2010 21:18
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > This is quite close.  You will have to break down the user agent that is
> > your query into the same kinds of pieces as you did for your index.
> > Lucene
> > will only do exact matching of terms during searching (wildcard queries
> > are
> > handled by exploding the term into all possible variants).
> >
> > Regarding the field type, you will probably have to customize that a
> > fair
> > bit to make +'s be separators and such.  If you use SOLR to index and
> > query
> > your data, then it will make sure that your separation into tokens is
> > compatible unless you are using shortened forms like you mention here.
> >
> > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <niro@lechill.com>
> > wrote:
> >
> > > Hi again Ted and many thanks for your efforts.
> > > Ok, just to be sure that we fully understand each other:
> > >
> > > In my index I will store partial useragents without any wildcards *,
> > e.g.
> > >
> > > Fire    (for Firefox)
> > > Inte    (Internet Explorer)
> > > Moz     (Mozill)
> > >
> > >
> > > When I during runtime search my index for Media objects that are
> > compatible
> > > with a useragent,
> > > e.g:
> > >
> > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > >
> > > Hopefully lucene / solr will serve me with all Media objects that
> > partially
> > > math my full user agent string and also perhaps some mismatches. To be
> > > absolutely sure that I only show Media objects that are compatible, I
> > will
> > > have to loop through the resultset in my program to do a final test
> > and
> > > exclude any mismatches.
> > >
> > > Is this what you are saying Ted, that I cant do the whole process in
> > Solr /
> > > Lucene, that I need to do the final test in my program (C#)?
> > >
> > > Also, Im using Solr 1.4, what fieldtype would you recommend to use for
> > the
> > > useragent ( tokenized)
> > >
> > > Okey, lets see what you have to say about this.
> > > Please bear with me, im all new to lucene and solr!!
> > >
> > > Regards
> > > Niclas
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 20:43
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Yes.  I think you have it.
> > >
> > > To explain in a bit more detail, I think that you should store a
> > tokenized
> > > form of the user agents and should query using a tokenized form of
> > your
> > > user
> > > agent.  This will retrieve documents that have partial matches to the
> > user
> > > agent of interest.  Many of these matches, however, may not meet the
> > > requirements of the wildcard expression in the documents.  As such,
> > you
> > > will
> > > need to look at each retrieved document to retrieve the wild
> > expression
> > > from
> > > each one in turn to test if the original (untokenized) query satisfies
> > the
> > > wildcard.
> > >
> > > If your wildcards are all of a positive nature as your example is,
> > then
> > > this
> > > should work pretty well.
> > >
> > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <niro@lechill.com>
> > wrote:
> > >
> > > > Hi Ted and thanks for all your efforts.
> > > > Listen im a little bit lost here trying to understand what you are
> > trying
> > > > to tell me :-)
> > > >
> > > > 1. I Store my useragents in a field that is tokenized.
> > > > 2. Then when I search, you are saying that I should "scan" down the
> > > matches
> > > > via a SOLR function, or what?
> > > > Are you referring to these functions in SOLR?
> > > >
> > > > http://wiki.apache.org/solr/FunctionQuery
> > > >
> > > >
> > > > Sorry for not grasping immmediatley!
> > > >
> > > > Regards Niclas
> > > >
> > > > -----Original Message-----
> > > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > > Sent: 05 February 2010 17:44
> > > > To: general@lucene.apache.org
> > > > Cc: java-user@lucene.apache.org
> > > > Subject: Re: Wildcard searches????
> > > >
> > > > Tokenize your user agent strings, then store the tokenized form
> > > separately
> > > > from the wild card.  At retrieval time, scan down the matches and
> > apply
> > > the
> > > > wildcard from each document to your original query.  The SOLR
> > function
> > > > query
> > > > might be useful for this as would be a custom hit collector.
> > > >
> > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <niro@lechill.com>
> > wrote:
> > > >
> > > > > Hi there, i facing a problem and would like to ask the community
> > for
> > > some
> > > > > help.
> > > > >
> > > > > In my index I store browser  useragent values as "wildcarded" /
> > > partial,
> > > > >  which should be understood that an indexed document
> > > > > should only be shown to end users if his browsers useragent
> > matches a
> > > > > wildcared usereragent in my document.
> > > > >
> > > > > So what I have Is actually a "reversed" matching, the wildcards
> > are in
> > > my
> > > > > document and NOT in my actual query.
> > > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> > query
> > > in
> > > > > style with:
> > > > >
> > > > > useragents:
> > > > >
> > > >
> > > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > > >
> > > > > In this example I would have a hit because Mozilla/4.0* matches
> > the
> > > > > useragent.
> > > > >
> > > > > <doc>
> > > > > <useragents>
> > > > >                Firefox*
> > > > >                Mozilla/4.0*
> > > > > </useragents>
> > > > > </doc>
> > > > >
> > > > >
> > > > > Regards
> > > > > Niclas
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ted Dunning, CTO
> > > > DeepDyve
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
>
>
>
>


-- 
Ted Dunning, CTO
DeepDyve

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message