lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Brown <...@us.ibm.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Sun, 03 Sep 2006 19:05:08 GMT

Just as you, I would PREFER not to change any of the base Lucene code -- and
I imagine there is still some way to do what I want (possibly by extending
some other existing class) with what is already available.  

Regarding point 0) -- You are right in that if I add "test phrase" to index
as UN_TOKENIZED, and then search on "test" or "phrase" individually, it will
not find them (unless they have been added separately by themselves) -- this
behavior is actually desirable in my case -- I am adding single keywords (or
phrases) to the document field (i.e. there is no sentence-type text), and I
want the search to return only results that have the keyword (or phrase)
that I added.  (Although at this point, I'm more concerned about being able
to return results when I search for an UN_TOKENIZED phrase that was added, I
actually would like to "normalize" a phrase (spaces) or a hyphenated word or
an underscored word to the same value -- e.g. MS-WORD or ms_WORd or "MS
Word" --> ms_word.

Regarding point 1) -- If there was a way to add the keywords (including
phrases with spaces) to the index, such that I could search for them using a
query returned by QueryParser.parse(<query_string>), I think this would suit
my needs.  I looked at (and ran) your Scratch example.  I almost think it
would work for my purpose, EXCEPT that, let's say I added 3 documents...

doc.add(new Field("keyword", "hyphenated-word", Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("keyword", "underscored-word", Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("keyword", "phrase with spaces", Field.Store.YES,
Field.Index.TOKENIZED));

If I add them as TOKENIZED, a search on "phrase" would return 1 hit, which
is NOT really what I want.  I want a hit for "phrase with spaces", but not
"phrase" or "spaces".  


Perhaps you understand my situation a bit more now and could provide some
additional insight.  Basically, whatever is added to the keyword field is
what will be used in the search (plus some other field values).  

Thanks.



Erick Erickson wrote:
> 
> Disclaimer: Of course I'm not as familiar with your problem space as you
> are, so I may be way out in left field, but...
> 
> I *still* think you're making waaaaaay too much work for yourself and need
> to examine your assumptions.
> 
> 0> But when you index something UN_TOKENIZED as in your example, I don't
> think you'll  find the words "phrase" and "test" if you just search for
> them
> individually.
> 
> 1> "doesn't parse apart phrases". Why do you care? The "usual" way of
> handling this is to just go ahead and parse them apart, then submit your
> query with quotes embedded. Would that serve your needs? If not, I bet you
> could make a cool regex that would do this for you and use a
> PatternAnalyzer. If neither of those work, instead of hacking Lucene code,
> make your own tokenizer by overriding one of the Lucene tokenizers. See
> Lucene in Action for example, the section on synonyms as I remember.....
> It'll be waaaaay less work in the long run than having to try to stay in
> synch by hacking the Lucene code. Especially for the next poor soul who
> has
> to maintain it........
> 
> 2> doesn't parse/separate underscores and hyphens. Why not use a
> PatternAnalyzer? You can make it do most anything you want with a clever
> regex.
> 
> 3> the other thing that might be producing unexpected results is that the
> default is OR for QueryParser.......
> 
> Since no amount of documentation replaces a program, I've included one
> illustrating what I'm talking about. It's less likely we're talk past each
> other this way. All it may *really* demonstrate is that I don't understand
> what you're trying to do at all, but at least we'll know <G>...
> 
> Notice in particular that it finds the hypenated words, but doesn't find
> their individual parts. Also,  quoted phrasew are found, even though the
> stop words aren't in the index (examined via luke). I used Eclipse to
> compile/run it......
> 
> Best
> Erick
> 
> package scratch;
> 
> import java.io.IOException;
> import java.util.regex.Pattern;
> 
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.StopFilter;
> import org.apache.lucene.index.memory.PatternAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> 
> public class Scratch {
> 
>     private Analyzer analyzer = null;
> 
>     private Analyzer getAnalyzer() {
>         if (analyzer == null) {
>             // Break on whitespace or anything that is NOT hyphen,
> underscore
>             // or letter.
>             Pattern pat = Pattern.compile("(\\s|[^-_a-zA-Z])");
>             analyzer = new PatternAnalyzer(pat, true,
>                     StopFilter.makeStopSet(STOP_WORDS));
>         }
>         return analyzer;
> 
>     }
> 
>     private void makeTestIndex() throws Exception {
>         IndexWriter writer = new IndexWriter("C:/mydir", getAnalyzer(),
> true);
>         String text = ("this is the test text hyphenated-Iamhyphenated
> underscored_IaMunDerScoRed");
>         Document doc = new Document();
>         doc = new Document();
>         doc.add(new Field("test", text, Field.Store.YES,
>                         Field.Index.TOKENIZED));
>         writer.addDocument(doc);
> 
>         writer.close();
>     }
> 
>     private void doSearch(String query, int expectedHits) throws Exception
> {
>         try {
>             QueryParser qp = new QueryParser("test", getAnalyzer());
>             qp.enable_tracing();
>             IndexSearcher srch = new IndexSearcher("C:/mydir");
>             Query tmp = qp.parse(query);
>             // Uncomment to see parsed form of query
>             // System.out.println("Parsed form is '" + tmp.toString() +
> "'");
>             Hits hits = srch.search(tmp);
> 
>             String msg = "";
> 
>             if (hits.length() == expectedHits) {
>                 msg = "Test passed ";
>             } else {
>                 msg = "************TEST FAILED************ ";
>             }
>             System.out.println(msg + "Expected "
>                     + Integer.toString(expectedHits) + " hits, got "
>                     + Integer.toString(hits.length()) + " hits");
> 
>         } catch (IOException e) {
>             System.out.println("Caught IOException");
>             e.printStackTrace();
>         }
>     }
> 
>     private void doSearchPhrase(String phrase, int expectedHits) throws
> Exception {
>         doSearch("\"" + phrase + "\"", expectedHits);
>     }
>     public static void main(String[] args) {
>         try {
>             Scratch scratch = new Scratch();
>             scratch.getAnalyzer();
>             scratch.makeTestIndex();
>             scratch.doSearch("underscored_iamunderscored", 1);
>             scratch.doSearch("underscored iamunderscored", 0);
>             scratch.doSearch("hyphenated-iamhyphenated", 1);
>             scratch.doSearch("iamunderscored", 0);
>             scratch.doSearch("underscored", 0);
> 
>             scratch.doSearchPhrase("this is the test text", 1);
>             scratch.doSearchPhrase("text with hyphenated-iamhyphenated",
> 1);
>             scratch.doSearchPhrase("text with underscored_iamunderscored",
> 0);
> 
> 
>         } catch (Exception e) {
>             System.err.println(e.getMessage());
>         }
>     }
> 
>     protected static final String[] STOP_WORDS = { "a", "an", "and",
> "are",
>             "as", "at", "b", "be", "but", "by", "c", "d", "e", "f", "for",
>             "if", "g", "h", "i", "in", "into", "is", "it", "j", "k", "l",
> "m",
>             "n", "no", "not", "o", "of", "on", "or", "p", "q", "r", "s",
>             "such", "t", "that", "the", "their", "then", "there", "these",
>             "they", "this", "to", "u", "v", "w", "was", "will", "with",
> "x",
>             "y", "z" };
> 
> }
> 
> 
> 
> On 9/2/06, Philip Brown <pmb@us.ibm.com> wrote:
>>
>>
>> I tend to agree with Mark.  I tried a query as so...
>>
>>    TermQuery query = new TermQuery(new Term("keywordField", "phrase
>> test"));
>>    IndexSearcher searcher= new IndexSearcher(activeIdx);
>>    Hits hits = searcher.search(query);
>>
>> And this produced the expected results.  When building the index, I did
>> NOT
>> enclose the keywords in quotes -- just added as UN_TOKENIZED.
>>
>> Philip
>>
>>
>> Mark Miller-5 wrote:
>> >
>> > I think if he wants to use the queryparser to parse his search strings
>> > that he has no choice but to modify it. It will eat any pair of quotes
>> > going through it no matter what analyzer is used.
>> >
>> > - Mark
>> >> Well, you're flying blind. Is the behavior rooted in the indexing or
>> >> querying? Since you can't answer that, you're reduced to trying random
>> >> things hoping that one of them works. A little like voodoo. I've
>> wasted
>> >> faaaaarrrrrr too much time trying to solve what I was *sure* was the
>> >> problem
>> >> only to find it was somewhere else (the last place I look, of course)
>> >> <G>...
>> >>
>> >> Using Luke on a RAMDir. No, I don't know how to, but it should be a
>> >> simple
>> >> thing to write the index to an FSDir at the same time you create your
>> >> RAMDir
>> >> and use Luke then. This is debugging, after all.
>> >>
>> >> I'd be really, really, really reluctant to modify the query parser
>> and/or
>> >> the tokenizer, since whenever I've been tempted it's usually because I
>> >> don't
>> >> understand the tools already provided. Then I have to maintain my
>> custom
>> >> code. Which sucks. Although it sure feels more productive to hack a
>> >> bunch of
>> >> code and get something that works 90% of the time, then spend weeks
>> >> making
>> >> the other 10% work than taking two days to find the 3 lines you
>> *really*
>> >> need <G>.
>> >>
>> >> Have you thought of a PatternAnalyzer? It takes a regular expression
>> >> as the
>> >> tokenizer and  (from the Javadoc)
>> >> <<< Efficient Lucene analyzer/tokenizer that preferably operates
on a
>> >> String
>> >> rather than a
>> >> Reader<http://java.sun.com/j2se/1.4/docs/api/java/io/Reader.html>,
>> >> that can flexibly separate text into terms via a regular expression
>> >> Pattern<
>> http://java.sun.com/j2se/1.4/docs/api/java/util/regex/Pattern.html>(with
>> >>
>> >> behaviour identical to
>> >> String.split(String)<
>> http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html#split%28java.lang.String%29
>> >),
>> >>
>> >> and that combines the functionality of
>> >> LetterTokenizer<
>> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LetterTokenizer.html
>> >,
>> >>
>> >> LowerCaseTokenizer<
>> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/LowerCaseTokenizer.html
>> >,
>> >>
>> >> WhitespaceTokenizer<
>> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/WhitespaceTokenizer.html
>> >,
>> >>
>> >> StopFilter<
>> file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/analysis/StopFilter.html
>> >into
>> >>
>> >> a single efficient multi-purpose class.>>>
>> >>
>> >> One word of caution, the regular expression consists of expressions
>> that
>> >> *break* tokens, not expressions that *form* words, which threw me at
>> >> first.
>> >> Just like the doc says, like splitstring <G>.... This is in 2.0,
>> >> although I
>> >> *believe* it's also in the contrib section of 1.9 (or is in the
>> >> regular API,
>> >> I forget).
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>> >>>
>> >>>
>> >>> No, I've never used Luke.  Is there an easy way to examine my
>> >>> RAMDirectory
>> >>> index?  I can create the index with no quoted keywords, and when I
>> >>> search
>> >>> for a keyword, I get back the expected results (just can't search for
>> a
>> >>> phrase that has whitespace in it).  If I create the index with
>> >>> phrases in
>> >>> quotes, then when I search for anything in double quotes, I get back
>> >>> nothing.  If I create the index with everything in quotes, then when
>> I
>> >>> search for anything by the keyword field, I get nothing, regardless
>> of
>> >>> whether I use quotes in the query string or not.  (I can get results
>> >>> back
>> >>> by
>> >>> searching on other fields.)  What do you think?
>> >>>
>> >>> Philip
>> >>>
>> >>>
>> >>> Erick Erickson wrote:
>> >>> >
>> >>> > OK, I've gotta ask. Have you examined your index with Luke to see
>> if
>> >>> what
>> >>> > you *think* is in the index actually *is*???
>> >>> >
>> >>> > Erick
>> >>> >
>> >>> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>> >>> >>
>> >>> >>
>> >>> >> Interesting...just ran a test where I put double quotes around
>> >>> everything
>> >>> >> (including single keywords) of source text and then ran searches
>> >>> for a
>> >>> >> known
>> >>> >> keyword with and without double quotes -- doesn't find either
>> time.
>> >>> >>
>> >>> >>
>> >>> >> Mark Miller-5 wrote:
>> >>> >> >
>> >>> >> > Sorry to hear you're having trouble. You indeed need the
double
>> >>> quotes
>> >>> >> in
>> >>> >> > the source text. You will also need them in the query
string.
>> Make
>> >>> sure
>> >>> >> > they
>> >>> >> > are in both places. My machine is hosed right now or I
would do
>> it
>> >>> for
>> >>> >> you
>> >>> >> > real quick. My guess is that I forgot to mention...no
only do
>> you
>> >>> need
>> >>> >> to
>> >>> >> > add the <QUOTED> definiton to the TOKEN section,
but below that
>> you
>> >>> >> will
>> >>> >> > find the grammer...you need to add <QUOTED> to the
grammer. If
>> you
>> >>> look
>> >>> >> > how
>> >>> >> > <NUM> and <APOSTROPHE> are done you will prob
see what you
>> >>> should do.
>> >>> >> If
>> >>> >> > not, my machine should be back up tomarrow...
>> >>> >> >
>> >>> >> > - Mark
>> >>> >> >
>> >>> >> > On 9/1/06, Philip Brown <pmb@us.ibm.com> wrote:
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> Well, I tried that, and it doesn't seem to work still.
 I would
>> be
>> >>> >> happy
>> >>> >> >> to
>> >>> >> >> zip up the new files, so you can see what I'm using
-- maybe
>> >>> you can
>> >>> >> get
>> >>> >> >> it
>> >>> >> >> to work.  The first time, I tried building the documents
>> without
>> >>> >> quotes
>> >>> >> >> surrounding each phrase.  Then, I retried by enclosing
every
>> >>> phrase
>> >>> >> >> within
>> >>> >> >> double quotes.  Neither seemed to work.  When constructing
the
>> >>> query
>> >>> >> >> string
>> >>> >> >> for the search, I always added the double quotes (otherwise,
>> it'd
>> >>> >> think
>> >>> >> >> it
>> >>> >> >> was multiple terms).  (I didn't even test the underscore
and
>> >>> >> hyphenated
>> >>> >> >> terms.)  I thought Lucene was (sort of by default)
set up to
>> >>> search
>> >>> >> >> quoted
>> >>> >> >> phrases.  From
>> >>> http://lucene.apache.org/java/docs/api/index.html -->
>> >>> A
>> >>> >> >> Phrase is a group of words surrounded by double quotes
such as
>> >>> "hello
>> >>> >> >> dolly".  So, this should be easy, right?  I must be
missing
>> >>> something
>> >>> >> >> stupid.
>> >>> >> >>
>> >>> >> >> Thanks,
>> >>> >> >>
>> >>> >> >> Philip
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> Mark Miller-5 wrote:
>> >>> >> >> >
>> >>> >> >> > So this will recognize anything in quotes as
a single token
>> and
>> >>> '_'
>> >>> >> and
>> >>> >> >> > '-' will not break up words. There may be some
repercussions
>> for
>> >>> the
>> >>> >> >> NUM
>> >>> >> >> > token but nothing I'd worry about. maybe you
want to use
>> Unicode
>> >>> for
>> >>> >> >> '-'
>> >>> >> >> > and '_' as well...I wouldn't worry about it myself.
>> >>> >> >> >
>> >>> >> >> > - Mark
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> > TOKEN : {                      // token patterns
>> >>> >> >> >
>> >>> >> >> >   // basic word: a sequence of digits & letters
>> >>> >> >> >   <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+
>
>> >>> >> >> >
>> >>> >> >> > | <QUOTED:     "\"" (~["\""])+ "\"">
>> >>> >> >> >
>> >>> >> >> >   // internal apostrophes: O'Reilly, you're,
O'Reilly's
>> >>> >> >> >   // use a post-filter to remove possesives
>> >>> >> >> > | <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+
>
>> >>> >> >> >
>> >>> >> >> >   // acronyms: U.S.A., I.B.M., etc.
>> >>> >> >> >   // use a post-filter to remove dots
>> >>> >> >> > | <ACRONYM: <ALPHA> "." (<ALPHA>
".")+ >
>> >>> >> >> >
>> >>> >> >> >   // company names like AT&T and Excite@Home.
>> >>> >> >> > | <COMPANY: <ALPHA> ("&"|"@") <ALPHA>
>
>> >>> >> >> >
>> >>> >> >> >   // email addresses
>> >>> >> >> > | <EMAIL: <ALPHANUM> (("."|"-"|"_")
<ALPHANUM>)* "@"
>> <ALPHANUM>
>> >>> >> >> > (("."|"-") <ALPHANUM>)+ >
>> >>> >> >> >
>> >>> >> >> >   // hostname
>> >>> >> >> > | <HOST: <ALPHANUM> ("." <ALPHANUM>)+
>
>> >>> >> >> >
>> >>> >> >> >   // floating point, serial, model numbers, ip
addresses,
>> etc.
>> >>> >> >> >   // every other segment must have at least one
digit
>> >>> >> >> > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
>> >>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
>> >>> >> >> >        | <ALPHANUM> (<P> <HAS_DIGIT>
<P> <ALPHANUM>)+
>> >>> >> >> >        | <HAS_DIGIT> (<P> <ALPHANUM>
<P> <HAS_DIGIT>)+
>> >>> >> >> >        | <ALPHANUM> <P> <HAS_DIGIT>
(<P> <ALPHANUM> <P>
>> >>> >> <HAS_DIGIT>)+
>> >>> >> >> >        | <HAS_DIGIT> <P> <ALPHANUM>
(<P> <HAS_DIGIT> <P>
>> >>> >> <ALPHANUM>)+
>> >>> >> >> >         )
>> >>> >> >> >   >
>> >>> >> >> > | <#P: ("_"|"-"|"/"|"."|",") >
>> >>> >> >> > | <#HAS_DIGIT:                      // at
least one digit
>> >>> >> >> >     (<LETTER>|<DIGIT>)*
>> >>> >> >> >     <DIGIT>
>> >>> >> >> >     (<LETTER>|<DIGIT>)*
>> >>> >> >> >   >
>> >>> >> >> >
>> >>> >> >> > | < #ALPHA: (<LETTER>)+>
>> >>> >> >> > | < #LETTER:                      // unicode
letters
>> >>> >> >> >       [
>> >>> >> >> >        "\u0041"-"\u005a",
>> >>> >> >> >        "\u0061"-"\u007a",
>> >>> >> >> >        "\u00c0"-"\u00d6",
>> >>> >> >> >        "\u00d8"-"\u00f6",
>> >>> >> >> >        "\u00f8"-"\u00ff",
>> >>> >> >> >        "\u0100"-"\u1fff",
>> >>> >> >> >        "-", "_"
>> >>> >> >> >       ]
>> >>> >> >> >   >
>> >>> >> >> >
>> >>> >> >> >
>> >>> >>
>> ---------------------------------------------------------------------
>> >>> >> >> > To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> >>> >> >> > For additional commands, e-mail:
>> >>> java-user-help@lucene.apache.org
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> View this message in context:
>> >>> >> >>
>> >>> >>
>> >>>
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6106920
>> >>>
>> >>> >> >> Sent from the Lucene - Java Users forum at Nabble.com.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> ---------------------------------------------------------------------
>> >>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> >> >> For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>> >>> >> >>
>> >>> >> >>
>> >>> >> >
>> >>> >> >
>> >>> >>
>> >>> >> --
>> >>> >> View this message in context:
>> >>> >>
>> >>>
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6107649
>> >>>
>> >>> >> Sent from the Lucene - Java Users forum at Nabble.com.
>> >>> >>
>> >>> >>
>> >>> >>
>> ---------------------------------------------------------------------
>> >>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6109067
>> >>>
>> >>> Sent from the Lucene - Java Users forum at Nabble.com.
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>>
>> >>>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6115360
>> Sent from the Lucene - Java Users forum at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Phrase-search-using-quotes----special-Tokenizer-tf2200760.html#a6125651
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message