Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 17772 invoked from network); 31 Mar 2008 14:55:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Mar 2008 14:55:34 -0000 Received: (qmail 65943 invoked by uid 500); 31 Mar 2008 14:55:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 65907 invoked by uid 500); 31 Mar 2008 14:55:26 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 65896 invoked by uid 99); 31 Mar 2008 14:55:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Mar 2008 07:55:26 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 72.14.214.234 as permitted sender) Received: from [72.14.214.234] (HELO hu-out-0506.google.com) (72.14.214.234) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Mar 2008 14:54:42 +0000 Received: by hu-out-0506.google.com with SMTP id 27so43578hub.15 for ; Mon, 31 Mar 2008 07:54:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=3a6I4l58bVbaCl9BO/Mg2yiGwaMfEgYL25dTFOhQuyY=; b=EoqZ/sZV+Khb265VZXTMBETbyDGjFlauOkHKWDK2BsKdjDQTZbe0Bw2FvJgzU6SpLm+Iytv3Ui1hFITRduS+NKuLsHT3vlPhnrGirkA8Co08U/PuiKEqCmSOCHt6l4dt+/hICeBC6gUPftp/y6iY7ejZ8oS/qAar5cK1d2Wi4dU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=mUFHZ4jEhciDtYotktDwGR01X9NiWn7nTbqazVcEJU8k4VMj2LkFksvw3YTXksonb9gg7NdktECH7dhgJwqdk/16WFeuZh/NpuEHtkxsunZqEoKMMzmefbFI2gOtGOHJpTpBEjNSSGh4BFOjj9wIYOZaVepZxCjq74AxidWjQ0s= Received: by 10.82.174.20 with SMTP id w20mr15655197bue.38.1206975292337; Mon, 31 Mar 2008 07:54:52 -0700 (PDT) Received: by 10.82.191.6 with HTTP; Mon, 31 Mar 2008 07:54:52 -0700 (PDT) Message-ID: <359a92830803310754g530a3a64t4ebace2ad249e8f7@mail.gmail.com> Date: Mon, 31 Mar 2008 10:54:52 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Tokenize on another character In-Reply-To: <2f9136240803310640w211fb2d1jb56e926ca05205ab@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_10321_12578555.1206975292333" References: <20080331094306.CRMZ17393.aamtaout02-winn.ispmail.ntl.com@smtp.ntlworld.com> <359a92830803310605q72801d05w73e3b4a76a0a38d4@mail.gmail.com> <2f9136240803310640w211fb2d1jb56e926ca05205ab@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_10321_12578555.1206975292333 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Much clearer. Here's what I'd try. Index UN_TOKENIZED as follows: for METAL MAN (bad pseudo-code...) Document doc = new Document(); doc.add("category", "GUITAR", Store.NO, UN_TOKENIZED); doc.add("category", "ROCK", Store.NO, UN_TOKENIZED); doc.add("category", "ROCK AND ROLL" , Store.NO, UN_TOKENIZED); doc.add("category", "METAL", Store.NO, UN_TOKENIZED); IndexWriter.add(doc); And similar for NOISE. Now, when you search for ROCK you should only get NOISE I think you can compose your own analyzer that chains some filters together to handle, say, lowercasing, removing punctuation, etc. Be sure you use the same analyzer for your query parsing. Note that you can use PerFieldAnalyzerWrapper to use a different analyzer for different fields if that's necessary... Best Erick On Mon, Mar 31, 2008 at 9:40 AM, Fiaz Khan wrote: > Thanks Erick.... > > Ok,.. > > I have a track called METAL MAN, this has 4 categories assigned to it like > so: > > GUITAR > ROCK > ROCK AND ROLL > METAL > > I have another track called NOISE with the following 3 categories: > > GUITAR > ROCK AND ROLL > METAL > > When a user searches using the keyword ROCK, it is finding both when > really it should only find METAL MAN. > > The reason for this is because... when i created the lucene index, i > concatenated all categories into one field called KEYWORDS like so: > METAL MAN keywords are: GUITAR, ROCK, ROCK AND ROLL, METAL, > NOISE keywords are: GUITAR, ROCK AND ROLL, METAL, > > So therefore, when tokenized using the standardanalyser, they end up as > METAL MAN keywords are: GUITAR, ROCK, ROCK, ROLL, METAL, > NOISE keywords are: GUITAR, ROCK, ROLL, METAL, > i.e. they are tokenized on the space char words like AND are removed. > > What i would like is for the tokenizer to split on the comma and leave > each keyword as is and not, for example, turn ROCK AND ROLL into ROCK, > ROLL > > My attempts so far were to replace spaces with "another" char, e.g. ~. > Strip spaces from the keyword. This broke the rest of the search > engine which doesnt need to work like this. > Un tokenize, issue with this was that i could no longer partial string > search as the work ROCK was being picked up in the word FROCK. > > Am i going about this the wrong way? > Caveat is that i am using .net version (2.1). Hopefully it is possible > with this version. > > Hope this explains it a bit better. > > On Mon, Mar 31, 2008 at 2:05 PM, Erick Erickson > wrote: > > I'm confused on the use case you're trying to implement, > > could you add a bit more explanation? > > > > In particular, do you ever want ROCK to match > > ROCK AND ROLL? If you want both, that is > > some searches match partial keywords and some > > match entire keywords, I recommend you create a > > second field in your document KEYWORD_EXACT or > > some such and index it UN_TOKENIZED (storage is > > optional). Also, you can index the KEYWORD field > > as TOKENIZED. Then, when you want to match exactly, > > you search against the first field, when you want to search > > on any piece, search the second. > > > > If this is completely off base, could you post the use-cases > > you're interested in? > > > > Best > > Erick > > > > > > On Mon, Mar 31, 2008 at 5:42 AM, wrote: > > > > > Hello > > > > > > I just joined the list and need some help. > > > > > > I have a database of music tracks.These tracks have been added to an > > > index. They are classified using keywords, so a track can have up to > > > 20 keywords assigned to them. I took the keywords and create a > > > "keyword" FIELD which was not stored and tokenized. The problem is > > > this... if a user searches for a specific keyword such as "ROCK", it > > > is finding as well as ROCK tracks, ROCK AND ROLL tracks. I realise > > > this is due to the tokenization of the keyword FIELD. My question is > > > this, how can i stop the analyser from tokenizing on the space > > > character and instead tokenize on one i specifiy. That way, if i > chose > > > to tokenize on a comma, i could add a comma at the end of every > > > keyword. Or have i gone about this the wrong way? > > > > > > Many thanks, any insight will be appreciated. > > > > > > Fiaz > > > > > > ----------------------------------------- > > > Email sent from www.virginmedia.com/email > > > Virus-checked using > > > > > > > McAfee(R) Software and scanned for spam > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_10321_12578555.1206975292333--