Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of uwe@thetaphi.de designates
 188.138.97.18 as permitted sender)
From: "Uwe Schindler" <uwe@thetaphi.de>
To: <java-user@lucene.apache.org>
References: 
 <CABrcCQ5s4EgLhyaQrgsA32jGnkFyXHH1xht4q_y7CHHD3UybRQ@mail.gmail.com>
 <CFE5807412604483838C8EAAB3DCDCE7@JackKrupansky>
 <CABrcCQ6xEJFnjSvM6rGKhDZv2O2xi87bWASC0Ef76LaeOPhO-w@mail.gmail.com>
In-Reply-To: 
 <CABrcCQ6xEJFnjSvM6rGKhDZv2O2xi87bWASC0Ef76LaeOPhO-w@mail.gmail.com>
Subject: RE: Lucene vs Glimpse
Date: Tue, 5 Feb 2013 10:43:03 +0100
Message-ID: <03f501ce0385$31f7cc20$95e76460$@thetaphi.de>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="utf-8"
Content-Transfer-Encoding: quoted-printable
thread-index: AQH1Je1StCzbDAQRcbmMUAReqb07cQIcg6KvAYjLVWqX/5J2wA==
Content-Language: de

Glimpse seems to use something similar like StandardAnalyzer. So I would =
give it a try. For program code this should work quite good. To make the =
"auto-phrases" work (which might be a good idea here, too), enable this =
feature in the query parser (I am referring to the comment by Jack about =
auto-phrase).

You don=E2=80=99t really need to take care about fields, too. The =
general approach for such types of search are:
- Create one field (indexed+stored) with the document ID (e.g. file =
name)
- One field (stored) with the document title (if applicable)
- One analyzed-only field (no storing needed, unless you want =
highlighting) called "content" that is getting the whole text of your =
program code


After that you can query the lucene index using the correctly configured =
query parser with default field "content", analyzer=3DStandardAnalyzer =
and auto-phrases enabled. The stored fields are only needed to "present =
the search results", it is just the metadata you display to the user =
after search.

That's all you need, you should give it a try! Your issue was just a =
configuration issue. That=E2=80=99s quite a common use case. Maybe you =
should buy the book "Lucene in Action 2nd edition" to learn more about =
correct text analysis and to get information about common techniques, =
how to index your data.

Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Mathias Dahl [mailto:mathias.dahl@gmail.com]
> Sent: Tuesday, February 05, 2013 10:26 AM
> To: java-user
> Subject: Re: Lucene vs Glimpse
>=20
> Jack,
>=20
> What you say sounds hopeful, but it also sounds like quite some work =
to
> define/select the correct analyzer for each type of programming =
language
> (we use SQL, PL/SQL, Java and C# mainly). Compared to what I do know
> which is just to throw all files at Glimpse and it makes them =
searchable in a
> very good way (it sounds like I am trying to sell Glimpse here or try =
to bash
> Lucene, but that is not my intent).
>=20
> What got me started thinking about this is that I got different query =
results
> for the same files using the Lucene demo examples and Glimpse.
> To be specific, it was this piece of code that Lucene did not find for
> me:
>=20
> ...
> import com.sun.org.apache.xerces.internal.parsers.DOMParser;
> ...
>=20
> Using Glimpse I get a hit on a file with that content by searching for =
"xerces".
> With Lucene I did not. So I changed the example code to use the
> ClassicAnalyzer which I interpreted as doing what I wanted (i.e.
> "split" at punctuation). That did not work either (I also changed the =
analyzer
> in the search example program). I am sure it is possible to make the =
above
> work, but then I started thinking that if the above should work, will =
I get a
> match for a string like "someObjectInstance.someMethod()"? If I =
understand
> it correctly the way to support searches like that is to really try to =
parse the
> Java language and put the necessary information in special "fields" in =
the
> index. But things kind of starts to grow here, if you think about what =
kind of
> searches people want to do (people do not want to think, I have =
noticed,
> they want to search like they do on Google, and I cannot even learn my
> developer colleagues to use regexps...) I would need to have separate
> analyzers (I guess) for different languages and take all these small =
details,
> when it comes to how people want to search, into account.
>=20
> Or is there some other clever way to do what I want? I was thinking =
that
> maybe I could do what Glimpse does on a high level (described here, =
btw:
> http://webglimpse.net/pubs/glimpse.pdf), and do some kind of =
combination
> of an index search and a search through the files.
>=20
> I hope this made things at least a little bit clearer ;) Again, I am =
seeing it from
> the perspective of a Glimpse user where the searches most people use =
"just
> work" (but due to licensing I don't think we can continue to use it).
>=20
> Thanks!
>=20
> /Mathias
>=20
> On Mon, Feb 4, 2013 at 9:31 PM, Jack Krupansky
> <jack@basetechnology.com> wrote:
> > Generally, all of your example queries should work fine with Lucene,
> > provided that you carefully choose your analyzer, or even use the
> > StandardAnalyzer. The special characters like underscore and dot
> > generally get treated as spaces and the resulting sequence of terms
> > would match as a phrase. It won't be a 100% solution, but it should =
do
> reasonably well.
> >
> > Is there a query that was failing to match reasonably for you?
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Mathias Dahl
> > Sent: Monday, February 04, 2013 1:01 PM
> > To: java-user@lucene.apache.org
> > Subject: Lucene vs Glimpse
> >
> >
> > Hi,
> >
> > I have hacked together a small web front end to the Glimpse text
> > indexing engine (see http://webglimpse.net/ for information). I am
> > very happy with how Glimpse indexes and searches data. If I =
understand
> > it correctly it uses a combination of an index and searching =
directly
> > in the files themselves as grep or other tools. The problem is that =
I
> > discovered it is not open source and now that I want to extend the =
use
> > from private to company wide I will run into license problems/costs.
> >
> > So, I decided to try out Lucene. I tried the examples and changed =
them
> > a bit to use another analyzer. But when I started to think about it =
I
> > realized that I will not be able to build something like Glimpse. At
> > least not easily.
> >
> > Why? I will try to explain:
> >
> > As stated above, Glimpse uses a combination of index and in-file
> > search. This makes it very powerful in the sense that I can get hits
> > for things that are not necessarily being indexes as terms. Let's =
say
> > I have a file with this content:
> >
> > ...
> > import foo.bar.baz;
> > ...
> >
> > With Glimpse, and without telling it how to index the content I can
> > find the above file using a search string like "foo" or "bar" but
> > also, and this is important, using foo.bar.baz.
> >
> > Another example:
> >
> > We have a lot of PL/SQL source code, and often you can find code =
like this:
> >
> > ...
> > My_Nice_API.Some_Method
> > ...
> >
> > Here too, Glimpse is almost magic since it combines index and normal
> > search. I can find the file above using "My_Nice_API" or
> > "My_Nice_API.Some_Method".
> >
> > In a sense I can have the cake and eat it too.
> >
> > If I want to do similar "free" search stuff with Lucene I think I =
have
> > to create analyzers for the different kind of source code files, =
with
> > fields for this and that. Quite an undertaking.
> >
> > Does anyone understand my point here and am I correct in that it =
would
> > be hard to implement something as "free" as with Glimpse? I am not
> > trying to critizise, just understand how Lucene (and Glimpse) works.
> >
> > Oh, yes, Glimpse has one big drawback: it only supports search =
strings
> > up to 32 characters.
> >
> > Thanks!
> >
> > /Mathias
> >
> > =
---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > =
---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org