Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 5668 invoked from network); 20 Jan 2005 22:25:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 20 Jan 2005 22:25:56 -0000 Received: (qmail 23984 invoked by uid 500); 20 Jan 2005 22:25:51 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 23953 invoked by uid 500); 20 Jan 2005 22:25:51 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 23938 invoked by uid 99); 20 Jan 2005 22:25:51 -0000 X-ASF-Spam-Status: No, hits=1.0 required=10.0 tests=SPF_HELO_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from reh001-1.rex001.exchangebyregister.com (HELO reh001-1.REX001.ExchangeByRegister.com) (64.78.19.14) by apache.org (qpsmtpd/0.28) with ESMTP; Thu, 20 Jan 2005 14:25:49 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5.7226.0 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Newbie: Human Readable Stemming, Lucene Architecture, etc! Date: Thu, 20 Jan 2005 14:25:46 -0800 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Newbie: Human Readable Stemming, Lucene Architecture, etc! Thread-Index: AcT/PMFL7Knss+aEQyqzE6ltG3W++AAAhNWA From: "Chuck Williams" To: "Lucene Users List" , "jian chen" X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Like any other field, A.I. is only elusive until you master it. There are plenty of companies using A.I. techniques in various IR applications successfully. LSI in particular has been around a long time and is well understood. Chuck > -----Original Message----- > From: jian chen [mailto:chenjian1227@gmail.com] > Sent: Thursday, January 20, 2005 2:10 PM > To: Lucene Users List > Subject: Re: Newbie: Human Readable Stemming, Lucene Architecture, etc! >=20 > Hi, >=20 > One thing to point out. I think Lucene is not using LSI as the > underlying retrieval model. It uses vector space model and also > proximity based retrieval. >=20 > Personally, I don't know much about LSI and I don't think the fancy > stuff like LSI is workable in industry. I believe we are far away from > the era of artificial intelligence and using any elusive way to do > information retrieval. >=20 > Cheers, >=20 > Jian >=20 >=20 > On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore > wrote: > > Hi .. I'm new to the list so forgive a dumb question or two as I get > > started. > > > > We're in the midst of converting a small collection (1200-1500 > > currently) of scientific literature to be easily searchable/navigable. > > We'll likely provide both a text query interface as well as a > graphical > > way to search and discover. > > > > Our initial approach will be vector based, looking at Latent Semantic > > Indexing (LSI) as a potential tool, although if that's not needed, > > we'll stop at reasonably simple stemming with a weighted document term > > matrix (DTM). (Bear in mind I couldn't even pronounce most of these > > concepts last week, so go easy if I'm incoherent!) > > > > It looks to me that Lucene has a quite well factored architecture. I > > should at the very least be able to use the analyzer and stemmer to > > create a good starting point in the project. I'd also like to leave a > > nice architecture behind in case we or others end up experimenting > > with, or extending, the system. > > > > So a couple of questions: > > > > 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) > > apparently produces non-word stems .. i.e. not really human readable. > > (Example: generate, generates, generated, generating -> generat) > > Although in typical queries this is not important because the result > of > > the search is a document list, it *would* be important if we use the > > stems within a graphical navigation interface. > > So the question is: Is there a way to have the stemmer produce > > english > > base forms of the words being stemmed? > > > > 2 - We're probably using Lucene in ways it was not designed for, such > > as DTM/LSI and graphical clustering and navigation. Naturally we'll > > provide code for these parts that are not in Lucene. > > But the question arises: is this kinda dumb?! Has anyone > stretched > > Lucene's > > design center with positive results? Are we barking up the wrong > > tree? > > > > 3 - A nit on hyphenation: Our collection is scientific so has many > > hyphenated words. I'm wondering about your experiences with > > hyphenation. In our collection, things like self-organization, > > power-law, space-time, small-world, agent-based, etc. occur often, for > > example. > > So the question is: Do folks break up hyphenated words? If not, > do > > you stem the > > parts and glue them back together? Do you apply stoplists to the > > parts? > > > > Thanks for any help and pointers you can fling along, > > > > Owen http://backspaces.net/ http://redfish.com/ > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org