lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Lucene vs Glimpse
Date Tue, 05 Feb 2013 09:43:03 GMT
Glimpse seems to use something similar like StandardAnalyzer. So I would give it a try. For
program code this should work quite good. To make the "auto-phrases" work (which might be
a good idea here, too), enable this feature in the query parser (I am referring to the comment
by Jack about auto-phrase).

You don’t really need to take care about fields, too. The general approach for such types
of search are:
- Create one field (indexed+stored) with the document ID (e.g. file name)
- One field (stored) with the document title (if applicable)
- One analyzed-only field (no storing needed, unless you want highlighting) called "content"
that is getting the whole text of your program code


After that you can query the lucene index using the correctly configured query parser with
default field "content", analyzer=StandardAnalyzer and auto-phrases enabled. The stored fields
are only needed to "present the search results", it is just the metadata you display to the
user after search.

That's all you need, you should give it a try! Your issue was just a configuration issue.
That’s quite a common use case. Maybe you should buy the book "Lucene in Action 2nd edition"
to learn more about correct text analysis and to get information about common techniques,
how to index your data.

Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Mathias Dahl [mailto:mathias.dahl@gmail.com]
> Sent: Tuesday, February 05, 2013 10:26 AM
> To: java-user
> Subject: Re: Lucene vs Glimpse
> 
> Jack,
> 
> What you say sounds hopeful, but it also sounds like quite some work to
> define/select the correct analyzer for each type of programming language
> (we use SQL, PL/SQL, Java and C# mainly). Compared to what I do know
> which is just to throw all files at Glimpse and it makes them searchable in a
> very good way (it sounds like I am trying to sell Glimpse here or try to bash
> Lucene, but that is not my intent).
> 
> What got me started thinking about this is that I got different query results
> for the same files using the Lucene demo examples and Glimpse.
> To be specific, it was this piece of code that Lucene did not find for
> me:
> 
> ...
> import com.sun.org.apache.xerces.internal.parsers.DOMParser;
> ...
> 
> Using Glimpse I get a hit on a file with that content by searching for "xerces".
> With Lucene I did not. So I changed the example code to use the
> ClassicAnalyzer which I interpreted as doing what I wanted (i.e.
> "split" at punctuation). That did not work either (I also changed the analyzer
> in the search example program). I am sure it is possible to make the above
> work, but then I started thinking that if the above should work, will I get a
> match for a string like "someObjectInstance.someMethod()"? If I understand
> it correctly the way to support searches like that is to really try to parse the
> Java language and put the necessary information in special "fields" in the
> index. But things kind of starts to grow here, if you think about what kind of
> searches people want to do (people do not want to think, I have noticed,
> they want to search like they do on Google, and I cannot even learn my
> developer colleagues to use regexps...) I would need to have separate
> analyzers (I guess) for different languages and take all these small details,
> when it comes to how people want to search, into account.
> 
> Or is there some other clever way to do what I want? I was thinking that
> maybe I could do what Glimpse does on a high level (described here, btw:
> http://webglimpse.net/pubs/glimpse.pdf), and do some kind of combination
> of an index search and a search through the files.
> 
> I hope this made things at least a little bit clearer ;) Again, I am seeing it from
> the perspective of a Glimpse user where the searches most people use "just
> work" (but due to licensing I don't think we can continue to use it).
> 
> Thanks!
> 
> /Mathias
> 
> On Mon, Feb 4, 2013 at 9:31 PM, Jack Krupansky
> <jack@basetechnology.com> wrote:
> > Generally, all of your example queries should work fine with Lucene,
> > provided that you carefully choose your analyzer, or even use the
> > StandardAnalyzer. The special characters like underscore and dot
> > generally get treated as spaces and the resulting sequence of terms
> > would match as a phrase. It won't be a 100% solution, but it should do
> reasonably well.
> >
> > Is there a query that was failing to match reasonably for you?
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Mathias Dahl
> > Sent: Monday, February 04, 2013 1:01 PM
> > To: java-user@lucene.apache.org
> > Subject: Lucene vs Glimpse
> >
> >
> > Hi,
> >
> > I have hacked together a small web front end to the Glimpse text
> > indexing engine (see http://webglimpse.net/ for information). I am
> > very happy with how Glimpse indexes and searches data. If I understand
> > it correctly it uses a combination of an index and searching directly
> > in the files themselves as grep or other tools. The problem is that I
> > discovered it is not open source and now that I want to extend the use
> > from private to company wide I will run into license problems/costs.
> >
> > So, I decided to try out Lucene. I tried the examples and changed them
> > a bit to use another analyzer. But when I started to think about it I
> > realized that I will not be able to build something like Glimpse. At
> > least not easily.
> >
> > Why? I will try to explain:
> >
> > As stated above, Glimpse uses a combination of index and in-file
> > search. This makes it very powerful in the sense that I can get hits
> > for things that are not necessarily being indexes as terms. Let's say
> > I have a file with this content:
> >
> > ...
> > import foo.bar.baz;
> > ...
> >
> > With Glimpse, and without telling it how to index the content I can
> > find the above file using a search string like "foo" or "bar" but
> > also, and this is important, using foo.bar.baz.
> >
> > Another example:
> >
> > We have a lot of PL/SQL source code, and often you can find code like this:
> >
> > ...
> > My_Nice_API.Some_Method
> > ...
> >
> > Here too, Glimpse is almost magic since it combines index and normal
> > search. I can find the file above using "My_Nice_API" or
> > "My_Nice_API.Some_Method".
> >
> > In a sense I can have the cake and eat it too.
> >
> > If I want to do similar "free" search stuff with Lucene I think I have
> > to create analyzers for the different kind of source code files, with
> > fields for this and that. Quite an undertaking.
> >
> > Does anyone understand my point here and am I correct in that it would
> > be hard to implement something as "free" as with Glimpse? I am not
> > trying to critizise, just understand how Lucene (and Glimpse) works.
> >
> > Oh, yes, Glimpse has one big drawback: it only supports search strings
> > up to 32 characters.
> >
> > Thanks!
> >
> > /Mathias
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message