lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: Wildcard searches????
Date Sat, 06 Feb 2010 04:14:06 GMT
Hi Niclas,


"generalization" of the user agent "without including the versions numbers"...

How will you separate Mozilla/5.0 (Browser) from Mozilla/5.0 (Googlebot)?

And, going to the root of a problem... why do you use SOLR such a way? Is it search service
showing different content depending on browser type (WAP vs. HTML)???

If it is, you are implementing so-called "business use case" improperly...

Search Engine Results Pages (SERP) should not have dependency on User-Agent HTTP Request Header.

But, raw TCP output may depend on it, and it is not SOLR/Lucene layer; it is upper layer...
Tomcat Servlet Container, for instance, may generate different output depending whether it
is mobile device (WAP) or browser (Mozilla compatible)...

I don't know your use case specifics... as Ted mentioned, it's much better to post SOLR-specific
questions in solr-user@lucene.apache.org...


-Fuad



> -----Original Message-----
> From: Niclas Rothman [mailto:niro@lechill.com]
> Sent: February-05-10 6:12 PM
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Hi Fuad and thanks for your reply!
> 
> The first post I know now was a wrong approach, I should not have the
> wildcard included in my index.
> 
> However, I can't do as you suggest, to have the full user agent in the
> index, that’s the whole idea actually.
> 
> The reason can be explained like this, device manufactures are literally
> spitting out new devices and updates all the time which generates new
> user agents that are very similar, perhaps only a small version number
> differs.
> So what I need is to have a "generalization" of the user agent in  my
> index, to only have the start of the useragent without including the
> versions numbers.
> This way my index are all the time "up to date" even if users with new
> version numbers access my search service, which in my app isn’t
> significant but instead causing my problems....
> 
> Example:
> 
> I have 2 Indexed documents where the documents useragent field are
> partial:
> <doc>
> 	<id>1</id>
> 	<useragents>
>       	Firefox
>             Mozilla/4.0+SonyEricsson
> 	</useragents>
> </doc>
> <doc>
> 	<id>2</id>
> 	<useragents>
>       	Firefox
>             Mozilla/4.0+SonyEricsson
> 	</useragents>
> </doc>
> 
> User A searches my app with an user agent as:
> 
> 	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI
> DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> 
> The search app will display both document 1 and 2, because his user
> agent starts exactly has the user agent pattern in my document.
> 
> 
> User B searches my app with an user agent as (Please note that this user
> agent differs in the near end from Users A (JP9.5.1 instead of
> JP8.4.1)):
> 
> 	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI
> DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0
> 
> The search app will also display both document 1 and 2, because his user
> agent starts exactly has the user agent pattern in my document.
> Even if the version number of the java platform differs between user A
> and  B.
> 
> If we now have a different index with FULL user agents, only User A
> would have documents returned, none of the documents user agents matched
> Users B user agent because of the "silly" version number!!
> 
> <doc>
> 	<id>1</id>
> 	<useragents>
>       	Firefox
> 
> Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> 	</useragents>
> </doc>
> <doc>
> 	<id>2</id>
> 	<useragents>
>       	Firefox
> 
> Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> 	</useragents>
> </doc>
> 
> Can you see my problem?
> So the basic thing is if I somehow can do a query saying that at match
> should take place if a document useragent starts with the value of the
> users useragent.
> 
> In theory, having a startsWith "function / locig are easy enough to
> implement in C# / T-SQL,  but how on earth should I do this in SolR /
> Lucene?????
> 
> Regards
> 
> Niclas
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: 05 February 2010 22:49
> To: general@lucene.apache.org
> Cc: java-user@lucene.apache.org
> Subject: RE: Wildcard searches????
> 
> Niclas,
> 
> I looked at your initial post, you are creating document with field
> "abc*"
> - nothing related to "wildcard query"!
> 
> Of course, query [useragents:abcdefghijklm] will return no results, and
> [q=useragents:abc] no results, but [q=useragents:abc*] will return
> something.
> 
> text_nav is specific SOLR type for _leading_ wildcard queries; you don't
> need it (you don't need _leading_ wildcard queries).
> 
> On indexing time, instead of
> <doc>
> <useragents>
>                 Firefox*
>                 Mozilla/4.0*
> </useragents>
> </doc>
> 
> 
> You should index
> <doc>
> <useragents>
> 	Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI
> DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0
> </useragents>
> </doc>
> 
> And also, you need to choose properly SOLR type; for instance, textTight
> or textgen, or even non-tokenized string!
> 
> 
> And, query [q=useragents:moz*] will return this document (even if this
> field is nontokenized).
> 
> 
> -Fuad
> 
> 
> P.S. Don't use * when you create Lucene document; use it as part of
> query.
> 
> 
> 
> 
> > -----Original Message-----
> > From: Niclas Rothman [mailto:niro@lechill.com]
> > Sent: February-05-10 4:44 PM
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: RE: Wildcard searches????
> >
> > Ted im using SOLR, but I cant figure out what type of fieldtype I
> should
> > use to get a query like this to work:
> >
> >
> > q=useragents: abcdefghijklm
> >
> >
> > where I have in my index one document with value "abc" in field
> > "useragents"
> >
> > That query results in 0 hits.
> >
> > If I issue this I get 1 hit of course (exact mathch)
> >
> > q=useragents: Mozilla
> >
> >
> > My document definition in SOLR looks like:
> >
> > <fields>
> >     <field name="id" type="tint" indexed="true" stored="true"
> > required="true" />
> >     <field name="useragents" type="text_rev" indexed="true"
> > stored="true" required="false" multiValued="true" />
> > </fields>
> >
> > Any clue?
> >
> > Nic
> >
> >
> >
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: 05 February 2010 21:18
> > To: general@lucene.apache.org
> > Cc: java-user@lucene.apache.org
> > Subject: Re: Wildcard searches????
> >
> > This is quite close.  You will have to break down the user agent that
> is
> > your query into the same kinds of pieces as you did for your index.
> > Lucene
> > will only do exact matching of terms during searching (wildcard
> queries
> > are
> > handled by exploding the term into all possible variants).
> >
> > Regarding the field type, you will probably have to customize that a
> > fair
> > bit to make +'s be separators and such.  If you use SOLR to index and
> > query
> > your data, then it will make sure that your separation into tokens is
> > compatible unless you are using shortened forms like you mention here.
> >
> > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman <niro@lechill.com>
> > wrote:
> >
> > > Hi again Ted and many thanks for your efforts.
> > > Ok, just to be sure that we fully understand each other:
> > >
> > > In my index I will store partial useragents without any wildcards *,
> > e.g.
> > >
> > > Fire    (for Firefox)
> > > Inte    (Internet Explorer)
> > > Moz     (Mozill)
> > >
> > >
> > > When I during runtime search my index for Media objects that are
> > compatible
> > > with a useragent,
> > > e.g:
> > >
> > >
> > >
> > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > >
> > > Hopefully lucene / solr will serve me with all Media objects that
> > partially
> > > math my full user agent string and also perhaps some mismatches. To
> be
> > > absolutely sure that I only show Media objects that are compatible,
> I
> > will
> > > have to loop through the resultset in my program to do a final test
> > and
> > > exclude any mismatches.
> > >
> > > Is this what you are saying Ted, that I cant do the whole process in
> > Solr /
> > > Lucene, that I need to do the final test in my program (C#)?
> > >
> > > Also, Im using Solr 1.4, what fieldtype would you recommend to use
> for
> > the
> > > useragent ( tokenized)
> > >
> > > Okey, lets see what you have to say about this.
> > > Please bear with me, im all new to lucene and solr!!
> > >
> > > Regards
> > > Niclas
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: 05 February 2010 20:43
> > > To: general@lucene.apache.org
> > > Cc: java-user@lucene.apache.org
> > > Subject: Re: Wildcard searches????
> > >
> > > Yes.  I think you have it.
> > >
> > > To explain in a bit more detail, I think that you should store a
> > tokenized
> > > form of the user agents and should query using a tokenized form of
> > your
> > > user
> > > agent.  This will retrieve documents that have partial matches to
> the
> > user
> > > agent of interest.  Many of these matches, however, may not meet the
> > > requirements of the wildcard expression in the documents.  As such,
> > you
> > > will
> > > need to look at each retrieved document to retrieve the wild
> > expression
> > > from
> > > each one in turn to test if the original (untokenized) query
> satisfies
> > the
> > > wildcard.
> > >
> > > If your wildcards are all of a positive nature as your example is,
> > then
> > > this
> > > should work pretty well.
> > >
> > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman <niro@lechill.com>
> > wrote:
> > >
> > > > Hi Ted and thanks for all your efforts.
> > > > Listen im a little bit lost here trying to understand what you are
> > trying
> > > > to tell me :-)
> > > >
> > > > 1. I Store my useragents in a field that is tokenized.
> > > > 2. Then when I search, you are saying that I should "scan" down
> the
> > > matches
> > > > via a SOLR function, or what?
> > > > Are you referring to these functions in SOLR?
> > > >
> > > > http://wiki.apache.org/solr/FunctionQuery
> > > >
> > > >
> > > > Sorry for not grasping immmediatley!
> > > >
> > > > Regards Niclas
> > > >
> > > > -----Original Message-----
> > > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > > Sent: 05 February 2010 17:44
> > > > To: general@lucene.apache.org
> > > > Cc: java-user@lucene.apache.org
> > > > Subject: Re: Wildcard searches????
> > > >
> > > > Tokenize your user agent strings, then store the tokenized form
> > > separately
> > > > from the wild card.  At retrieval time, scan down the matches and
> > apply
> > > the
> > > > wildcard from each document to your original query.  The SOLR
> > function
> > > > query
> > > > might be useful for this as would be a custom hit collector.
> > > >
> > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman <niro@lechill.com>
> > wrote:
> > > >
> > > > > Hi there, i facing a problem and would like to ask the community
> > for
> > > some
> > > > > help.
> > > > >
> > > > > In my index I store browser  useragent values as "wildcarded" /
> > > partial,
> > > > >  which should be understood that an indexed document
> > > > > should only be shown to end users if his browsers useragent
> > matches a
> > > > > wildcared usereragent in my document.
> > > > >
> > > > > So what I have Is actually a "reversed" matching, the wildcards
> > are in
> > > my
> > > > > document and NOT in my actual query.
> > > > > Does anyone know if this "setup" Is possible, e.g. to execute a
> > query
> > > in
> > > > > style with:
> > > > >
> > > > > useragents:
> > > > >
> > > >
> > >
> "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP-
> > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0"
> > > > >
> > > > > In this example I would have a hit because Mozilla/4.0* matches
> > the
> > > > > useragent.
> > > > >
> > > > > <doc>
> > > > > <useragents>
> > > > >                Firefox*
> > > > >                Mozilla/4.0*
> > > > > </useragents>
> > > > > </doc>
> > > > >
> > > > >
> > > > > Regards
> > > > > Niclas
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ted Dunning, CTO
> > > > DeepDyve
> > > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> 




Mime
View raw message