lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Hungarian notation analyzer and phrase queries
Date Wed, 13 Apr 2005 01:01:33 GMT

This question is very similar to a recent/current thread sith the subject
"How to include a multi-word synonym to a word when indexing?" ...

http://www.mail-archive.com/java-user@lucene.apache.org/msg00546.html

...As Erik points out in that thread, when dealing with a dictionary of
"singleword" => ["multi" "word"], and ["multi" "word"] => "singleword"
synonyms a very good/simple approach is to use an analyzer that allways
normalizes down to the single word version (as a single token)

This allows you to leave the position incriment alone, and get Span/Phrase
queries to work just fine.

In your case, if the source text being indexed contains allready contains
only the "SingleWord" forms in "HungarianNotation", you don't need to
start out with a dictionary -- you can build one while indexing using a
TokenFilter (ie: if a token starts with capital, and contains capital,
split on capitals and log to dictionary file).  Then at search time, use a
different TokenFilter that looks up token sequences in your dictionary
and replaces them with the single token version.

If you have mixed usage in the source text being indexed, you may have to
do two passes to build a dictionary, then index.



: Date: Tue, 12 Apr 2005 17:42:23 -0600
: From: Paul Smith <PSmith@tenfold.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Hungarian notation analyzer and phrase queries
:
: I am writing a document management system for my company, and many of
: our feature names are in Hungarian notation (PowerQuery,
: TransactionManager, etc.). This can make it hard to find some things
: with a default analyzer.
:
: I'd like to be able to index text like "Use PowerQuery for advanced
: searches" and be able to find it with "use power query for advanced
: searches". Note the space between power and query.
:
: I have written a custom analyzer to tokenize PowerQuery into  'power',
: 'query, and 'powerquery' and change the position increment to 0, but I
: don't quite get the desired behavior. The phrase query "use power query
: for advanced searches" does not match, but "use query for advanced
: searches", and "use power for advanced searches" do.
:
: Any ideas?
:
: I noticed that Dave at tropo.com has written a JavaDocAnalyzer that has
: the same problem. Go to searchmorph.com and search for "An instance of
: HashMap has two parameters" and "An instance of Hash Map has two
: parameters"
:
: I realize that with my custom analyzer I can find it without using a
: phrase query, but it would be nice.
:
: Thanks,
: Paul
:
: http://www.tenfold.com
:
: **********************************************************************
: This email and any files transmitted with it are confidential and
: intended solely for the use of the individual or entity to whom they
: are addressed. If you have received this email in error please notify
: the TenFold Postmaster (postmaster@tenfold.com).
: **********************************************************************
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message