Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B935AE304 for ; Tue, 5 Feb 2013 09:43:37 +0000 (UTC) Received: (qmail 38649 invoked by uid 500); 5 Feb 2013 09:43:35 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 37687 invoked by uid 500); 5 Feb 2013 09:43:31 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 37504 invoked by uid 99); 5 Feb 2013 09:43:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Feb 2013 09:43:30 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of uwe@thetaphi.de designates 188.138.97.18 as permitted sender) Received: from [188.138.97.18] (HELO mail.sd-datasolutions.de) (188.138.97.18) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Feb 2013 09:43:23 +0000 Received: from VEGA (gate1.marum.de [134.102.237.1]) by mail.sd-datasolutions.de (Postfix) with ESMTPSA id EF57314AA055 for ; Tue, 5 Feb 2013 09:43:02 +0000 (UTC) From: "Uwe Schindler" To: References: In-Reply-To: Subject: RE: Lucene vs Glimpse Date: Tue, 5 Feb 2013 10:43:03 +0100 Message-ID: <03f501ce0385$31f7cc20$95e76460$@thetaphi.de> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 14.0 thread-index: AQH1Je1StCzbDAQRcbmMUAReqb07cQIcg6KvAYjLVWqX/5J2wA== Content-Language: de X-Virus-Checked: Checked by ClamAV on apache.org Glimpse seems to use something similar like StandardAnalyzer. So I would = give it a try. For program code this should work quite good. To make the = "auto-phrases" work (which might be a good idea here, too), enable this = feature in the query parser (I am referring to the comment by Jack about = auto-phrase). You don=E2=80=99t really need to take care about fields, too. The = general approach for such types of search are: - Create one field (indexed+stored) with the document ID (e.g. file = name) - One field (stored) with the document title (if applicable) - One analyzed-only field (no storing needed, unless you want = highlighting) called "content" that is getting the whole text of your = program code After that you can query the lucene index using the correctly configured = query parser with default field "content", analyzer=3DStandardAnalyzer = and auto-phrases enabled. The stored fields are only needed to "present = the search results", it is just the metadata you display to the user = after search. That's all you need, you should give it a try! Your issue was just a = configuration issue. That=E2=80=99s quite a common use case. Maybe you = should buy the book "Lucene in Action 2nd edition" to learn more about = correct text analysis and to get information about common techniques, = how to index your data. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de > -----Original Message----- > From: Mathias Dahl [mailto:mathias.dahl@gmail.com] > Sent: Tuesday, February 05, 2013 10:26 AM > To: java-user > Subject: Re: Lucene vs Glimpse >=20 > Jack, >=20 > What you say sounds hopeful, but it also sounds like quite some work = to > define/select the correct analyzer for each type of programming = language > (we use SQL, PL/SQL, Java and C# mainly). Compared to what I do know > which is just to throw all files at Glimpse and it makes them = searchable in a > very good way (it sounds like I am trying to sell Glimpse here or try = to bash > Lucene, but that is not my intent). >=20 > What got me started thinking about this is that I got different query = results > for the same files using the Lucene demo examples and Glimpse. > To be specific, it was this piece of code that Lucene did not find for > me: >=20 > ... > import com.sun.org.apache.xerces.internal.parsers.DOMParser; > ... >=20 > Using Glimpse I get a hit on a file with that content by searching for = "xerces". > With Lucene I did not. So I changed the example code to use the > ClassicAnalyzer which I interpreted as doing what I wanted (i.e. > "split" at punctuation). That did not work either (I also changed the = analyzer > in the search example program). I am sure it is possible to make the = above > work, but then I started thinking that if the above should work, will = I get a > match for a string like "someObjectInstance.someMethod()"? If I = understand > it correctly the way to support searches like that is to really try to = parse the > Java language and put the necessary information in special "fields" in = the > index. But things kind of starts to grow here, if you think about what = kind of > searches people want to do (people do not want to think, I have = noticed, > they want to search like they do on Google, and I cannot even learn my > developer colleagues to use regexps...) I would need to have separate > analyzers (I guess) for different languages and take all these small = details, > when it comes to how people want to search, into account. >=20 > Or is there some other clever way to do what I want? I was thinking = that > maybe I could do what Glimpse does on a high level (described here, = btw: > http://webglimpse.net/pubs/glimpse.pdf), and do some kind of = combination > of an index search and a search through the files. >=20 > I hope this made things at least a little bit clearer ;) Again, I am = seeing it from > the perspective of a Glimpse user where the searches most people use = "just > work" (but due to licensing I don't think we can continue to use it). >=20 > Thanks! >=20 > /Mathias >=20 > On Mon, Feb 4, 2013 at 9:31 PM, Jack Krupansky > wrote: > > Generally, all of your example queries should work fine with Lucene, > > provided that you carefully choose your analyzer, or even use the > > StandardAnalyzer. The special characters like underscore and dot > > generally get treated as spaces and the resulting sequence of terms > > would match as a phrase. It won't be a 100% solution, but it should = do > reasonably well. > > > > Is there a query that was failing to match reasonably for you? > > > > -- Jack Krupansky > > > > -----Original Message----- From: Mathias Dahl > > Sent: Monday, February 04, 2013 1:01 PM > > To: java-user@lucene.apache.org > > Subject: Lucene vs Glimpse > > > > > > Hi, > > > > I have hacked together a small web front end to the Glimpse text > > indexing engine (see http://webglimpse.net/ for information). I am > > very happy with how Glimpse indexes and searches data. If I = understand > > it correctly it uses a combination of an index and searching = directly > > in the files themselves as grep or other tools. The problem is that = I > > discovered it is not open source and now that I want to extend the = use > > from private to company wide I will run into license problems/costs. > > > > So, I decided to try out Lucene. I tried the examples and changed = them > > a bit to use another analyzer. But when I started to think about it = I > > realized that I will not be able to build something like Glimpse. At > > least not easily. > > > > Why? I will try to explain: > > > > As stated above, Glimpse uses a combination of index and in-file > > search. This makes it very powerful in the sense that I can get hits > > for things that are not necessarily being indexes as terms. Let's = say > > I have a file with this content: > > > > ... > > import foo.bar.baz; > > ... > > > > With Glimpse, and without telling it how to index the content I can > > find the above file using a search string like "foo" or "bar" but > > also, and this is important, using foo.bar.baz. > > > > Another example: > > > > We have a lot of PL/SQL source code, and often you can find code = like this: > > > > ... > > My_Nice_API.Some_Method > > ... > > > > Here too, Glimpse is almost magic since it combines index and normal > > search. I can find the file above using "My_Nice_API" or > > "My_Nice_API.Some_Method". > > > > In a sense I can have the cake and eat it too. > > > > If I want to do similar "free" search stuff with Lucene I think I = have > > to create analyzers for the different kind of source code files, = with > > fields for this and that. Quite an undertaking. > > > > Does anyone understand my point here and am I correct in that it = would > > be hard to implement something as "free" as with Glimpse? I am not > > trying to critizise, just understand how Lucene (and Glimpse) works. > > > > Oh, yes, Glimpse has one big drawback: it only supports search = strings > > up to 32 characters. > > > > Thanks! > > > > /Mathias > > > > = --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > = --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org