lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mathias Dahl <mathias.d...@gmail.com>
Subject Re: Lucene vs Glimpse
Date Tue, 05 Feb 2013 09:26:21 GMT
Jack,

What you say sounds hopeful, but it also sounds like quite some work
to define/select the correct analyzer for each type of programming
language (we use SQL, PL/SQL, Java and C# mainly). Compared to what I
do know which is just to throw all files at Glimpse and it makes them
searchable in a very good way (it sounds like I am trying to sell
Glimpse here or try to bash Lucene, but that is not my intent).

What got me started thinking about this is that I got different query
results for the same files using the Lucene demo examples and Glimpse.
To be specific, it was this piece of code that Lucene did not find for
me:

...
import com.sun.org.apache.xerces.internal.parsers.DOMParser;
...

Using Glimpse I get a hit on a file with that content by searching for
"xerces". With Lucene I did not. So I changed the example code to use
the ClassicAnalyzer which I interpreted as doing what I wanted (i.e.
"split" at punctuation). That did not work either (I also changed the
analyzer in the search example program). I am sure it is possible to
make the above work, but then I started thinking that if the above
should work, will I get a match for a string like
"someObjectInstance.someMethod()"? If I understand it correctly the
way to support searches like that is to really try to parse the Java
language and put the necessary information in special "fields" in the
index. But things kind of starts to grow here, if you think about what
kind of searches people want to do (people do not want to think, I
have noticed, they want to search like they do on Google, and I cannot
even learn my developer colleagues to use regexps...) I would need to
have separate analyzers (I guess) for different languages and take all
these small details, when it comes to how people want to search, into
account.

Or is there some other clever way to do what I want? I was thinking
that maybe I could do what Glimpse does on a high level (described
here, btw: http://webglimpse.net/pubs/glimpse.pdf), and do some kind
of combination of an index search and a search through the files.

I hope this made things at least a little bit clearer ;) Again, I am
seeing it from the perspective of a Glimpse user where the searches
most people use "just work" (but due to licensing I don't think we can
continue to use it).

Thanks!

/Mathias

On Mon, Feb 4, 2013 at 9:31 PM, Jack Krupansky <jack@basetechnology.com> wrote:
> Generally, all of your example queries should work fine with Lucene,
> provided that you carefully choose your analyzer, or even use the
> StandardAnalyzer. The special characters like underscore and dot generally
> get treated as spaces and the resulting sequence of terms would match as a
> phrase. It won't be a 100% solution, but it should do reasonably well.
>
> Is there a query that was failing to match reasonably for you?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Mathias Dahl
> Sent: Monday, February 04, 2013 1:01 PM
> To: java-user@lucene.apache.org
> Subject: Lucene vs Glimpse
>
>
> Hi,
>
> I have hacked together a small web front end to the Glimpse text
> indexing engine (see http://webglimpse.net/ for information). I am
> very happy with how Glimpse indexes and searches data. If I understand
> it correctly it uses a combination of an index and searching directly
> in the files themselves as grep or other tools. The problem is that I
> discovered it is not open source and now that I want to extend the use
> from private to company wide I will run into license problems/costs.
>
> So, I decided to try out Lucene. I tried the examples and changed them
> a bit to use another analyzer. But when I started to think about it I
> realized that I will not be able to build something like Glimpse. At
> least not easily.
>
> Why? I will try to explain:
>
> As stated above, Glimpse uses a combination of index and in-file
> search. This makes it very powerful in the sense that I can get hits
> for things that are not necessarily being indexes as terms. Let's say
> I have a file with this content:
>
> ...
> import foo.bar.baz;
> ...
>
> With Glimpse, and without telling it how to index the content I can
> find the above file using a search string like "foo" or "bar" but
> also, and this is important, using foo.bar.baz.
>
> Another example:
>
> We have a lot of PL/SQL source code, and often you can find code like this:
>
> ...
> My_Nice_API.Some_Method
> ...
>
> Here too, Glimpse is almost magic since it combines index and normal
> search. I can find the file above using "My_Nice_API" or
> "My_Nice_API.Some_Method".
>
> In a sense I can have the cake and eat it too.
>
> If I want to do similar "free" search stuff with Lucene I think I have
> to create analyzers for the different kind of source code files, with
> fields for this and that. Quite an undertaking.
>
> Does anyone understand my point here and am I correct in that it would
> be hard to implement something as "free" as with Glimpse? I am not
> trying to critizise, just understand how Lucene (and Glimpse) works.
>
> Oh, yes, Glimpse has one big drawback: it only supports search strings
> up to 32 characters.
>
> Thanks!
>
> /Mathias
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message