Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Received-SPF: pass (hermes.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Date: Thu, 20 Jan 2005 14:25:46 -0800
Message-ID: 
 <E3381E0825F1954D953A4E347017BB1C031405CE@reh001-1.REX001.ExchangeByRegister.com>
Thread-Topic: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Thread-Index: AcT/PMFL7Knss+aEQyqzE6ltG3W++AAAhNWA
From: "Chuck Williams" <chuck@manawiz.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>,
	"jian chen" <chenjian1227@gmail.com>

Like any other field, A.I. is only elusive until you master it.  There
are plenty of companies using A.I. techniques in various IR applications
successfully. LSI in particular has been around a long time and is well
understood.

Chuck

  > -----Original Message-----
  > From: jian chen [mailto:chenjian1227@gmail.com]
  > Sent: Thursday, January 20, 2005 2:10 PM
  > To: Lucene Users List
  > Subject: Re: Newbie: Human Readable Stemming, Lucene Architecture,
etc!
  >=20
  > Hi,
  >=20
  > One thing to point out. I think Lucene is not using LSI as the
  > underlying retrieval model. It uses vector space model and also
  > proximity based retrieval.
  >=20
  > Personally, I don't know much about LSI and I don't think the fancy
  > stuff like LSI is workable in industry. I believe we are far away
from
  > the era of artificial intelligence and using any elusive way to do
  > information retrieval.
  >=20
  > Cheers,
  >=20
  > Jian
  >=20
  >=20
  > On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore
<owen@backspaces.net>
  > wrote:
  > > Hi .. I'm new to the list so forgive a dumb question or two as I
get
  > > started.
  > >
  > > We're in the midst of converting a small collection (1200-1500
  > > currently) of scientific literature to be easily
searchable/navigable.
  > > We'll likely provide both a text query interface as well as a
  > graphical
  > > way to search and discover.
  > >
  > > Our initial approach will be vector based, looking at Latent
Semantic
  > > Indexing (LSI) as a potential tool, although if that's not needed,
  > > we'll stop at reasonably simple stemming with a weighted document
term
  > > matrix (DTM).  (Bear in mind I couldn't even pronounce most of
these
  > > concepts last week, so go easy if I'm incoherent!)
  > >
  > > It looks to me that Lucene has a quite well factored architecture.
I
  > > should at the very least be able to use the analyzer and stemmer
to
  > > create a good starting point in the project.  I'd also like to
leave a
  > > nice architecture behind in case we or others end up experimenting
  > > with, or extending, the system.
  > >
  > > So a couple of questions:
  > >
  > > 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball)
  > > apparently produces non-word stems .. i.e. not really human
readable.
  > > (Example: generate, generates, generated, generating  -> generat)
  > > Although in typical queries this is not important because the
result
  > of
  > > the search is a document list, it *would* be important if we use
the
  > > stems within a graphical navigation interface.
  > >      So the question is: Is there a way to have the stemmer
produce
  > > english
  > >      base forms of the words being stemmed?
  > >
  > > 2 - We're probably using Lucene in ways it was not designed for,
such
  > > as DTM/LSI and graphical clustering and navigation.  Naturally
we'll
  > > provide code for these parts that are not in Lucene.
  > >      But the question arises: is this kinda dumb?!  Has anyone
  > stretched
  > > Lucene's
  > >      design center with positive results?  Are we barking up the
wrong
  > > tree?
  > >
  > > 3 - A nit on hyphenation: Our collection is scientific so has many
  > > hyphenated words.  I'm wondering about your experiences with
  > > hyphenation.  In our collection, things like self-organization,
  > > power-law, space-time, small-world, agent-based, etc. occur often,
for
  > > example.
  > >      So the question is: Do folks break up hyphenated words?  If
not,
  > do
  > > you stem the
  > >      parts and glue them back together?  Do you apply stoplists to
the
  > > parts?
  > >
  > > Thanks for any help and pointers you can fling along,
  > >
  > > Owen    http://backspaces.net/    http://redfish.com/
  > >
  > >
---------------------------------------------------------------------
  > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > > For additional commands, e-mail:
lucene-user-help@jakarta.apache.org
  > >
  > >
  >=20
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org