Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 59464 invoked from network); 21 Oct 2002 22:55:53 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 21 Oct 2002 22:55:53 -0000 Received: (qmail 2690 invoked by uid 97); 21 Oct 2002 22:56:44 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 2674 invoked by uid 97); 21 Oct 2002 22:56:43 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 2662 invoked by uid 98); 21 Oct 2002 22:56:43 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: <20021021225552.70472.qmail@web12707.mail.yahoo.com> Date: Mon, 21 Oct 2002 15:55:52 -0700 (PDT) From: Otis Gospodnetic Subject: Re: Tags Screwing up Searches To: Lucene Users List In-Reply-To: <020201c2794a$e1016120$0201a8c0@netframe.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Si, with StandardAnalyzer, I believe, since neither < nor > are alphabetical characters. Otis --- Terry Steichen wrote: > How should this be done (the translation, that is)? If it were left > as '<' > and '>', would Lucene parse it properly? > > Terry > > ----- Original Message ----- > From: "Otis Gospodnetic" > To: "Lucene Users List" > Sent: Monday, October 21, 2002 5:40 PM > Subject: Re: Tags Screwing up Searches > > > > Thanks for the update. > > This all sounds right (no bugs). The problem is the code that you > have > > that translates those < and > characters. > > > > Otis > > > > --- Terry Steichen wrote: > > > Otis, > > > > > > I discovered that the actual text that I was dealing with already > > > converted > > > the '<' converted to '<', and so forth. So the problem is > that > > > with > > > something like '<b>College Soccer</b>', Lucene > recognizes > > > the > > > trailing semi-colon ';' as a word separator, so it can find the > term > > > 'college', but it does not see the ending of 'soccer'. I did > confirm > > > that > > > it *will* match on 'soccer<' just fine. > > > > > > I've proceeded to add a string substitution method which replaces > > > '<' > > > with ' ' (four spaces, in order to hopefully keep the offsets > > > straight). > > > It appears to work, though I believe it slows down the indexing. > > > > > > I don't know enough about the inner design of Lucene to figure > this > > > out, but > > > it seems logical that there would be a much more efficient way to > > > handle > > > this than string operations. > > > > > > Anyway, thought I'd bring you up to date. > > > > > > Regards, > > > > > > Terry > > > > > > PS: I've had no responses from the list, so perhaps this is a > unique > > > problem > > > and doesn't justify a formal fix effort. > > > > > > ----- Original Message ----- > > > From: "Terry Steichen" > > > To: "Lucene Users Group" > > > Sent: Friday, October 18, 2002 11:39 AM > > > Subject: Tags Screwing up Searches > > > > > > > > > Some content I'm indexing contains certain HTML tags, like

, > , > > > , > > > etc. What I find is that when a term I'm searching for touches > one > > > of these > > > tags (which is fairly typical), the term isn't recognized and the > > > search > > > fails. For example, College Soccer doesn't match on > either > > > "college" > > > or "soccer". I seem to recall someone else bring up a similar > > > problem with > > > a word that ends a sentence (and is thus treated as if the period > was > > > part > > > of the word), but don't recall what the response was and I can't > find > > > that > > > thread. > > > > > > Does anyone have some ideas on what's the best way to handle > this? > > > Filter > > > out the tags in the process of creating the Document for > indexing? Or > > > through a modification to the Analyzer (I'm using the > > > StandardAnalyzer)? Or > > > something else? > > > > > > TIA, > > > > > > Terry > > > > > > > > > > > > > > > -- > > > To unsubscribe, e-mail: > > > > > > For additional commands, e-mail: > > > > > > > > > > > > __________________________________________________ > > Do you Yahoo!? > > Y! Web Hosting - Let the expert host your web site > > http://webhosting.yahoo.com/ > > > > -- > > To unsubscribe, e-mail: > > > For additional commands, e-mail: > > > > > > > > -- > To unsubscribe, e-mail: > > For additional commands, e-mail: > > > __________________________________________________ Do you Yahoo!? Y! Web Hosting - Let the expert host your web site http://webhosting.yahoo.com/ -- To unsubscribe, e-mail: For additional commands, e-mail: