Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2E0C4184AB for ; Tue, 21 Jul 2015 09:11:07 +0000 (UTC) Received: (qmail 60858 invoked by uid 500); 21 Jul 2015 09:11:04 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 60804 invoked by uid 500); 21 Jul 2015 09:11:04 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 60786 invoked by uid 99); 21 Jul 2015 09:11:04 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jul 2015 09:11:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 09501D65D5 for ; Tue, 21 Jul 2015 09:11:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.9 X-Spam-Level: *** X-Spam-Status: No, score=3.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, KAM_INFOUSMEBIZ=0.75, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id uw5nm26qATzm for ; Tue, 21 Jul 2015 09:10:54 +0000 (UTC) Received: from mail-wg0-f53.google.com (mail-wg0-f53.google.com [74.125.82.53]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 2580324961 for ; Tue, 21 Jul 2015 09:10:54 +0000 (UTC) Received: by wgmn9 with SMTP id n9so150693089wgm.0 for ; Tue, 21 Jul 2015 02:10:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=cuWmDYmMmx7Y5i0aFX4y9E6OgKxd8FHHJvcRvD3U/8g=; b=ztblwJNgHx8YjdP2FJ/8N7CgajE6PZQyx1tNZfAcajK/Yy4G2T2IXJK/up4+EWecaD dvVEvUFRRSaE6vn0VsmgsY6ehIIHeQGr/t304jiLTo0iju799UbdBYgunsopMVbO4mKJ aDdmfrkfPFReDBEavlcRc4XpmZhJ2bPChsqL6HNr5tEXFqlDB9bSdi0D4OWVYjC+WV9Y QQd9ydTgc0cd34l7exTwkq3pcqHi1mMY/AUG/rDtoKVoHNZoh66JyPlWKItDvPLuY01U 8iscr5ShwUCv8t9jXXe0wdL4xleXJNilNDGdVgOov6nSgVA27teb3P1gha8irMASQGHX 5Fmw== MIME-Version: 1.0 X-Received: by 10.180.90.81 with SMTP id bu17mr29693538wib.35.1437469852690; Tue, 21 Jul 2015 02:10:52 -0700 (PDT) Received: by 10.194.64.37 with HTTP; Tue, 21 Jul 2015 02:10:52 -0700 (PDT) In-Reply-To: References: Date: Tue, 21 Jul 2015 10:10:52 +0100 Message-ID: Subject: Re: Analyzer for supporting hyphenated words From: Alessandro Benedetti To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=f46d0435c012f1f0fc051b5f074c --f46d0435c012f1f0fc051b5f074c Content-Type: text/plain; charset=UTF-8 Hi Diego, let me try to help : I find this a little bit confused : "For our customer it is important to find the word - *wi-fi* by wi, *fi*, wifi, wi-fi - jean-pierre by jean, pierre, jean-pierre, jean-*" But : " The (exact) query "*FD-A320-REC-SIM-1*" returns FD-A320-REC-SIM-1 MIA-*FD-A320-REC-SIM-1* SIN-FD-A320-REC-SIM-1 for our customer this is wrong because this exact phrase match query should only return the single entry FD-A320-REC-SIM-1 " If you noticed the suffix "fi" in the first example can be compared to the suffix "FD-A320-REC-SIM-1" in the second. To qualify your requirement : Do you want the user to be able to surround the query with "" to run the phrase query with a NOT tokenized phrase ? Because by default , a phrase query is tokenized like the others, but term positions affect the matching ! In the case I identified your requirement, we can have a think to a solution! Cheers 2015-07-17 9:41 GMT+01:00 Diego Socaceti : > Hi all, > > i'm new to lucene and tried to write my own analyzer to support > hyphenated words like wi-fi, jean-pierre, etc. > For our customer it is important to find the word > - wi-fi by wi, fi, wifi, wi-fi > - jean-pierre by jean, pierre, jean-pierre, jean-* > > > > > The analyzer: > public class SupportHyphenatedWordsAnalyzer extends Analyzer { > > protected NormalizeCharMap charConvertMap; > > public MinLuceneAnalyzer() { > initCharConvertMap(); > } > > protected void initCharConvertMap() { > NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder(); > builder.add("\"", ""); > charConvertMap = builder.build(); > } > > @Override > protected TokenStreamComponents createComponents(final String fieldName) > { > > final Tokenizer src = new WhitespaceTokenizer(); > > TokenStream tok = new WordDelimiterFilter(src, > WordDelimiterFilter.PRESERVE_ORIGINAL > | WordDelimiterFilter.GENERATE_WORD_PARTS > | WordDelimiterFilter.GENERATE_NUMBER_PARTS > | WordDelimiterFilter.CATENATE_WORDS, > null); > tok = new LowerCaseFilter(tok); > tok = new LengthFilter(tok, 1, 255); > tok = new StopFilter(tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET); > > return new TokenStreamComponents(src, tok); > } > > @Override > protected Reader initReader(String fieldName, Reader reader) { > return new MappingCharFilter(charConvertMap, reader); > } > } > > > > > > The analyzer seems to work except for exact phrase match queries. > > e.g. the following words are indexed > > FD-A320-REC-SIM-1 > FD-A320-REC-SIM-10 > FD-A320-REC-SIM-11 > MIA-FD-A320-REC-SIM-1 > SIN-FD-A320-REC-SIM-1 > > > The (exact) query "FD-A320-REC-SIM-1" returns > FD-A320-REC-SIM-1 > MIA-FD-A320-REC-SIM-1 > SIN-FD-A320-REC-SIM-1 > > for our customer this is wrong because this exact phrase match > query should only return the single entry FD-A320-REC-SIM-1 > > Do you have any ideas or tips, how we have to change our current > analyzer to support this requirement??? > > > Thanks and Kind regards > Diego > -- -------------------------- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England --f46d0435c012f1f0fc051b5f074c--