Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com
 designates 72.14.214.234 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=mUFHZ4jEhciDtYotktDwGR01X9NiWn7nTbqazVcEJU8k4VMj2LkFksvw3YTXksonb9gg7NdktECH7dhgJwqdk/16WFeuZh/NpuEHtkxsunZqEoKMMzmefbFI2gOtGOHJpTpBEjNSSGh4BFOjj9wIYOZaVepZxCjq74AxidWjQ0s=
Message-ID: <359a92830803310754g530a3a64t4ebace2ad249e8f7@mail.gmail.com>
Date: Mon, 31 Mar 2008 10:54:52 -0400
From: "Erick Erickson" <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Tokenize on another character
In-Reply-To: <2f9136240803310640w211fb2d1jb56e926ca05205ab@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_10321_12578555.1206975292333"
References: 
 <20080331094306.CRMZ17393.aamtaout02-winn.ispmail.ntl.com@smtp.ntlworld.com>
	 <359a92830803310605q72801d05w73e3b4a76a0a38d4@mail.gmail.com>
	 <2f9136240803310640w211fb2d1jb56e926ca05205ab@mail.gmail.com>

------=_Part_10321_12578555.1206975292333
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Much clearer. Here's what I'd try.
Index UN_TOKENIZED as follows:

for METAL MAN (bad pseudo-code...)
Document doc = new Document();
doc.add("category", "GUITAR", Store.NO, UN_TOKENIZED);
doc.add("category", "ROCK", Store.NO, UN_TOKENIZED);
doc.add("category", "ROCK AND ROLL" , Store.NO, UN_TOKENIZED);
doc.add("category", "METAL", Store.NO, UN_TOKENIZED);
IndexWriter.add(doc);


And similar for NOISE.

Now, when you search for ROCK you should only get NOISE

I think you can compose your own analyzer that chains some filters
together to handle, say, lowercasing, removing punctuation, etc.

Be sure you use the same analyzer for your query parsing. Note
that you can use PerFieldAnalyzerWrapper to use a different
analyzer for different fields if that's necessary...

Best
Erick

On Mon, Mar 31, 2008 at 9:40 AM, Fiaz Khan <fiaz.khan@ntlworld.com> wrote:

> Thanks Erick....
>
> Ok,..
>
> I have a track called METAL MAN, this has 4 categories assigned to it like
> so:
>
> GUITAR
> ROCK
> ROCK AND ROLL
> METAL
>
> I have another track called NOISE with the following 3 categories:
>
> GUITAR
> ROCK AND ROLL
> METAL
>
> When a user searches using the keyword ROCK, it is finding both when
> really it should only find METAL MAN.
>
> The reason for this is because... when i created the lucene index, i
> concatenated all categories into one field called KEYWORDS like so:
> METAL MAN keywords are: GUITAR, ROCK, ROCK AND ROLL, METAL,
> NOISE keywords are: GUITAR, ROCK AND ROLL, METAL,
>
> So therefore, when tokenized using the standardanalyser, they end up as
> METAL MAN keywords are: GUITAR, ROCK, ROCK, ROLL, METAL,
> NOISE keywords are: GUITAR, ROCK, ROLL, METAL,
> i.e. they are tokenized on the space char words like AND are removed.
>
> What i would like is for the tokenizer to split on the comma and leave
> each keyword as is and not, for example, turn ROCK AND ROLL into ROCK,
> ROLL
>
> My attempts so far were to replace spaces with "another" char, e.g. ~.
> Strip spaces from the keyword. This broke the rest of the search
> engine which doesnt need to work like this.
> Un tokenize, issue with this was that i could no longer partial string
> search as the work ROCK was being picked up in the word FROCK.
>
> Am i going about this the wrong way?
> Caveat is that i am using .net version (2.1). Hopefully it is possible
> with this version.
>
> Hope this explains it a bit better.
>
> On Mon, Mar 31, 2008 at 2:05 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
> > I'm confused on the use case you're trying to implement,
> >  could you add a bit more explanation?
> >
> >  In particular, do you ever want ROCK to match
> >  ROCK AND ROLL? If you want both, that is
> >  some searches match partial keywords and some
> >  match entire keywords, I recommend you create a
> >  second field in your document KEYWORD_EXACT or
> >  some such and index it UN_TOKENIZED (storage is
> >  optional). Also, you can index the KEYWORD field
> >  as TOKENIZED. Then, when you want to match exactly,
> >  you search against the first field, when you want to search
> >  on any piece, search the second.
> >
> >  If this is completely off base, could you post the use-cases
> >  you're interested in?
> >
> >  Best
> >  Erick
> >
> >
> >  On Mon, Mar 31, 2008 at 5:42 AM, <fiaz.khan@ntlworld.com> wrote:
> >
> >  > Hello
> >  >
> >  > I just joined the list and need some help.
> >  >
> >  > I have a database of music tracks.These tracks have been added to an
> >  > index. They are classified using keywords, so a track can have up to
> >  > 20 keywords assigned to them. I took the keywords and create a
> >  > "keyword" FIELD which was not stored and tokenized. The problem is
> >  > this... if a user searches for a specific keyword such as "ROCK", it
> >  > is finding as well as ROCK tracks, ROCK AND ROLL tracks. I realise
> >  > this is due to the tokenization of the keyword FIELD. My question is
> >  > this, how can i stop the analyser from tokenizing on the space
> >  > character and instead tokenize on one i specifiy. That way, if i
> chose
> >  > to tokenize on a comma, i could add a comma at the end of every
> >  > keyword. Or have i gone about this the wrong way?
> >  >
> >  > Many thanks, any insight will be appreciated.
> >  >
> >  > Fiaz
> >  >
> >  > -----------------------------------------
> >  > Email sent from www.virginmedia.com/email
> >  > Virus-checked <http://www.virginmedia.com/emailVirus-checked> using
> >
> >
> > > McAfee(R) Software and scanned for spam
> >  >
> >  >
> >  > ---------------------------------------------------------------------
> >  > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >  > For additional commands, e-mail: java-user-help@lucene.apache.org
> >  >
> >  >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_10321_12578555.1206975292333--