Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 6135 invoked from network); 6 Feb 2010 04:14:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 6 Feb 2010 04:14:50 -0000 Received: (qmail 1394 invoked by uid 500); 6 Feb 2010 04:14:49 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 1186 invoked by uid 500); 6 Feb 2010 04:14:49 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 1175 invoked by uid 99); 6 Feb 2010 04:14:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Feb 2010 04:14:48 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [64.34.111.254] (HELO barmail1.idig.net) (64.34.111.254) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Feb 2010 04:14:39 +0000 Received: from cpweb10.idig.net (cpweb10.idig.net [65.39.182.10]) by barmail1.idig.net (Spam & Virus Firewall) with ESMTP id 422BE3E7A5E7; Fri, 5 Feb 2010 20:14:17 -0800 (PST) Received: from cpweb10.idig.net (cpweb10.idig.net [65.39.182.10]) by barmail1.idig.net with ESMTP id snXu2w5GFcPdHCHS; Fri, 05 Feb 2010 20:14:17 -0800 (PST) Received: from out.clearnet.com ([216.198.139.38] helo=FuadPC) by cpweb10.idig.net with esmtp (Exim 4.69) (envelope-from ) id 1Ndc3k-00046Y-0m; Fri, 05 Feb 2010 20:14:17 -0800 From: "Fuad Efendi" To: Cc: References: <2D088220C451EC44B520D000D22502B0645A0FBD05@pacha.corp.lechillmobile.com> <2D088220C451EC44B520D000D22502B0645A0FBD11@pacha.corp.lechillmobile.com> <2D088220C451EC44B520D000D22502B0645A0FBD17@pacha.corp.lechillmobile.com> <2D088220C451EC44B520D000D22502B0645A0FBD18@pacha.corp.lechillmobile.com> <06e601caa6ac$ff82fa40$fe88eec0$@ca> <2D088220C451EC44B520D000D22502B0645A0FBD1A@pacha.corp.lechillmobile.com> In-Reply-To: <2D088220C451EC44B520D000D22502B0645A0FBD1A@pacha.corp.lechillmobile.com> Subject: RE: Wildcard searches???? Date: Sat, 6 Feb 2010 00:14:06 -0400 Message-ID: <072201caa6e2$d4f13e00$7ed3ba00$@ca> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AcqmoG7MH9ld2H9wQNulgnq6HyMbdQAAtFBAAAIPKEAAAJ5HQAAM3RVg Content-Language: en-ca X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - cpweb10.idig.net X-AntiAbuse: Original Domain - lucene.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - efendi.ca X-Source: X-Source-Args: X-Source-Dir: Hi Niclas, "generalization" of the user agent "without including the versions = numbers"... How will you separate Mozilla/5.0 (Browser) from Mozilla/5.0 = (Googlebot)? And, going to the root of a problem... why do you use SOLR such a way? = Is it search service showing different content depending on browser type = (WAP vs. HTML)??? If it is, you are implementing so-called "business use case" = improperly... Search Engine Results Pages (SERP) should not have dependency on = User-Agent HTTP Request Header. But, raw TCP output may depend on it, and it is not SOLR/Lucene layer; = it is upper layer... Tomcat Servlet Container, for instance, may = generate different output depending whether it is mobile device (WAP) or = browser (Mozilla compatible)... I don't know your use case specifics... as Ted mentioned, it's much = better to post SOLR-specific questions in solr-user@lucene.apache.org... -Fuad > -----Original Message----- > From: Niclas Rothman [mailto:niro@lechill.com] > Sent: February-05-10 6:12 PM > To: general@lucene.apache.org > Cc: java-user@lucene.apache.org > Subject: RE: Wildcard searches???? >=20 > Hi Fuad and thanks for your reply! >=20 > The first post I know now was a wrong approach, I should not have the > wildcard included in my index. >=20 > However, I can't do as you suggest, to have the full user agent in the > index, that=E2=80=99s the whole idea actually. >=20 > The reason can be explained like this, device manufactures are = literally > spitting out new devices and updates all the time which generates new > user agents that are very similar, perhaps only a small version number > differs. > So what I need is to have a "generalization" of the user agent in my > index, to only have the start of the useragent without including the > versions numbers. > This way my index are all the time "up to date" even if users with new > version numbers access my search service, which in my app = isn=E2=80=99t > significant but instead causing my problems.... >=20 > Example: >=20 > I have 2 Indexed documents where the documents useragent field are > partial: > > 1 > > Firefox > Mozilla/4.0+SonyEricsson > > > > 2 > > Firefox > Mozilla/4.0+SonyEricsson > > >=20 > User A searches my app with an user agent as: >=20 > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI > DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0 >=20 > The search app will display both document 1 and 2, because his user > agent starts exactly has the user agent pattern in my document. >=20 >=20 > User B searches my app with an user agent as (Please note that this = user > agent differs in the near end from Users A (JP9.5.1 instead of > JP8.4.1)): >=20 > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI > DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP9.5.1+UP.Link/6.3.1.20.0 >=20 > The search app will also display both document 1 and 2, because his = user > agent starts exactly has the user agent pattern in my document. > Even if the version number of the java platform differs between user A > and B. >=20 > If we now have a different index with FULL user agents, only User A > would have documents returned, none of the documents user agents = matched > Users B user agent because of the "silly" version number!! >=20 > > 1 > > Firefox >=20 > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0 > > > > 2 > > Firefox >=20 > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0 > > >=20 > Can you see my problem? > So the basic thing is if I somehow can do a query saying that at match > should take place if a document useragent starts with the value of the > users useragent. >=20 > In theory, having a startsWith "function / locig are easy enough to > implement in C# / T-SQL, but how on earth should I do this in SolR / > Lucene????? >=20 > Regards >=20 > Niclas >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 >=20 > -----Original Message----- > From: Fuad Efendi [mailto:fuad@efendi.ca] > Sent: 05 February 2010 22:49 > To: general@lucene.apache.org > Cc: java-user@lucene.apache.org > Subject: RE: Wildcard searches???? >=20 > Niclas, >=20 > I looked at your initial post, you are creating document with field > "abc*" > - nothing related to "wildcard query"! >=20 > Of course, query [useragents:abcdefghijklm] will return no results, = and > [q=3Duseragents:abc] no results, but [q=3Duseragents:abc*] will return > something. >=20 > text_nav is specific SOLR type for _leading_ wildcard queries; you = don't > need it (you don't need _leading_ wildcard queries). >=20 > On indexing time, instead of > > > Firefox* > Mozilla/4.0* > > >=20 >=20 > You should index > > > Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MI > DP-2.1+Configuration/CLDC-1.1+JavaPlatform/JP8.4.1+UP.Link/6.3.1.20.0 > > >=20 > And also, you need to choose properly SOLR type; for instance, = textTight > or textgen, or even non-tokenized string! >=20 >=20 > And, query [q=3Duseragents:moz*] will return this document (even if = this > field is nontokenized). >=20 >=20 > -Fuad >=20 >=20 > P.S. Don't use * when you create Lucene document; use it as part of > query. >=20 >=20 >=20 >=20 > > -----Original Message----- > > From: Niclas Rothman [mailto:niro@lechill.com] > > Sent: February-05-10 4:44 PM > > To: general@lucene.apache.org > > Cc: java-user@lucene.apache.org > > Subject: RE: Wildcard searches???? > > > > Ted im using SOLR, but I cant figure out what type of fieldtype I > should > > use to get a query like this to work: > > > > > > q=3Duseragents: abcdefghijklm > > > > > > where I have in my index one document with value "abc" in field > > "useragents" > > > > That query results in 0 hits. > > > > If I issue this I get 1 hit of course (exact mathch) > > > > q=3Duseragents: Mozilla > > > > > > My document definition in SOLR looks like: > > > > > > > required=3D"true" /> > > > stored=3D"true" required=3D"false" multiValued=3D"true" /> > > > > > > Any clue? > > > > Nic > > > > > > > > > > -----Original Message----- > > From: Ted Dunning [mailto:ted.dunning@gmail.com] > > Sent: 05 February 2010 21:18 > > To: general@lucene.apache.org > > Cc: java-user@lucene.apache.org > > Subject: Re: Wildcard searches???? > > > > This is quite close. You will have to break down the user agent = that > is > > your query into the same kinds of pieces as you did for your index. > > Lucene > > will only do exact matching of terms during searching (wildcard > queries > > are > > handled by exploding the term into all possible variants). > > > > Regarding the field type, you will probably have to customize that a > > fair > > bit to make +'s be separators and such. If you use SOLR to index = and > > query > > your data, then it will make sure that your separation into tokens = is > > compatible unless you are using shortened forms like you mention = here. > > > > On Fri, Feb 5, 2010 at 12:03 PM, Niclas Rothman > > wrote: > > > > > Hi again Ted and many thanks for your efforts. > > > Ok, just to be sure that we fully understand each other: > > > > > > In my index I will store partial useragents without any wildcards = *, > > e.g. > > > > > > Fire (for Firefox) > > > Inte (Internet Explorer) > > > Moz (Mozill) > > > > > > > > > When I during runtime search my index for Media objects that are > > compatible > > > with a useragent, > > > e.g: > > > > > > > > > > > = "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0" > > > > > > Hopefully lucene / solr will serve me with all Media objects that > > partially > > > math my full user agent string and also perhaps some mismatches. = To > be > > > absolutely sure that I only show Media objects that are = compatible, > I > > will > > > have to loop through the resultset in my program to do a final = test > > and > > > exclude any mismatches. > > > > > > Is this what you are saying Ted, that I cant do the whole process = in > > Solr / > > > Lucene, that I need to do the final test in my program (C#)? > > > > > > Also, Im using Solr 1.4, what fieldtype would you recommend to use > for > > the > > > useragent ( tokenized) > > > > > > Okey, lets see what you have to say about this. > > > Please bear with me, im all new to lucene and solr!! > > > > > > Regards > > > Niclas > > > > > > > > > > > > > > > -----Original Message----- > > > From: Ted Dunning [mailto:ted.dunning@gmail.com] > > > Sent: 05 February 2010 20:43 > > > To: general@lucene.apache.org > > > Cc: java-user@lucene.apache.org > > > Subject: Re: Wildcard searches???? > > > > > > Yes. I think you have it. > > > > > > To explain in a bit more detail, I think that you should store a > > tokenized > > > form of the user agents and should query using a tokenized form of > > your > > > user > > > agent. This will retrieve documents that have partial matches to > the > > user > > > agent of interest. Many of these matches, however, may not meet = the > > > requirements of the wildcard expression in the documents. As = such, > > you > > > will > > > need to look at each retrieved document to retrieve the wild > > expression > > > from > > > each one in turn to test if the original (untokenized) query > satisfies > > the > > > wildcard. > > > > > > If your wildcards are all of a positive nature as your example is, > > then > > > this > > > should work pretty well. > > > > > > On Fri, Feb 5, 2010 at 9:09 AM, Niclas Rothman > > wrote: > > > > > > > Hi Ted and thanks for all your efforts. > > > > Listen im a little bit lost here trying to understand what you = are > > trying > > > > to tell me :-) > > > > > > > > 1. I Store my useragents in a field that is tokenized. > > > > 2. Then when I search, you are saying that I should "scan" down > the > > > matches > > > > via a SOLR function, or what? > > > > Are you referring to these functions in SOLR? > > > > > > > > http://wiki.apache.org/solr/FunctionQuery > > > > > > > > > > > > Sorry for not grasping immmediatley! > > > > > > > > Regards Niclas > > > > > > > > -----Original Message----- > > > > From: Ted Dunning [mailto:ted.dunning@gmail.com] > > > > Sent: 05 February 2010 17:44 > > > > To: general@lucene.apache.org > > > > Cc: java-user@lucene.apache.org > > > > Subject: Re: Wildcard searches???? > > > > > > > > Tokenize your user agent strings, then store the tokenized form > > > separately > > > > from the wild card. At retrieval time, scan down the matches = and > > apply > > > the > > > > wildcard from each document to your original query. The SOLR > > function > > > > query > > > > might be useful for this as would be a custom hit collector. > > > > > > > > On Fri, Feb 5, 2010 at 7:57 AM, Niclas Rothman = > > wrote: > > > > > > > > > Hi there, i facing a problem and would like to ask the = community > > for > > > some > > > > > help. > > > > > > > > > > In my index I store browser useragent values as "wildcarded" = / > > > partial, > > > > > which should be understood that an indexed document > > > > > should only be shown to end users if his browsers useragent > > matches a > > > > > wildcared usereragent in my document. > > > > > > > > > > So what I have Is actually a "reversed" matching, the = wildcards > > are in > > > my > > > > > document and NOT in my actual query. > > > > > Does anyone know if this "setup" Is possible, e.g. to execute = a > > query > > > in > > > > > style with: > > > > > > > > > > useragents: > > > > > > > > > > > > > "Mozilla/4.0+SonyEricssonC905v/R1DE+Browser/NetFront/3.4+Profile/MIDP- > > 2.1+Configuration/CLDC-1.1+JavaPlatform/JP-8.4.1+UP.Link/6.3.1.20.0" > > > > > > > > > > In this example I would have a hit because Mozilla/4.0* = matches > > the > > > > > useragent. > > > > > > > > > > > > > > > > > > > > Firefox* > > > > > Mozilla/4.0* > > > > > > > > > > > > > > > > > > > > > > > > > Regards > > > > > Niclas > > > > > > > > > > > > > > > > > > > > > -- > > > > Ted Dunning, CTO > > > > DeepDyve > > > > > > > > > > > > > > > > -- > > > Ted Dunning, CTO > > > DeepDyve > > > > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve >=20