lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Posselt Vestergaard" <ppv_milest...@hotmail.com>
Subject RE: analyzer effecting phrases?
Date Mon, 20 Dec 2004 17:43:50 GMT
Hi again
Thanks for your answer, Otis. My analyzer did not do anything else than
using the WhitespaceAnalyzer/LowerCaseFilter.
However I found out that I got problems with characters such as ",.:" when
searching because of my simple analyzer. (E.g. I would not be able to search
for "world" in the string "Hello world." as . became part of the last word).

Therefore I turned back to the standard analyzer and now do some replacing
of the underscores in my ID string to avoid my original problem. This solved
my phrase problem so that I can now search for phrases. However I still have
the problem with ",.:" described above. As far as I can see the
StandardAnalyzer (the StandardTokenizer that is) should tokenize words
without the ",.:" characters. Am I mistaken? Is there a tokenizer that will
do this?
Thanks for the help!
Regards
Peter

> Date: Mon, 20 Dec 2004 08:19:42 -0800 (PST)
> From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
> Subject: analyzer effecting phrases?
> Content-Type: text/plain; charset=us-ascii
> 
> 
> When searching for phrases, what's important is the position of each
> token/word extracted by the Analyzer. 
> WhitespaceAnalyzer/LowerCaseFilter don't do anything with the
> positional information.  There is nothing else in your Analyzer?
> 
> In any case, the following should help you see what your Analyzer is
> doing:
> http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can
> augment the code there to provide positional information, too.
> 
> Otis
> 
> -----Original Message-----
> From: Peter Posselt Vestergaard [mailto:ppv_milestone@hotmail.com] 
> Sent: 20. december 2004 15:24
> To: 'lucene-user@jakarta.apache.org'
> Subject: analyzer effecting phrases?
> 
> Hi
> I am building an index of texts, each related to a unique id. 
> The unique ids might contain a number of underscores which 
> will make the standardanalyzer shorten them after it sees the 
> second underscore in a row. Furthermore many of the texts I 
> am indexing is in Italian so the removal of 'trivial' words 
> done by the standard analyzer is not necessarily meaningful 
> for these texts. Therefore I am instead using an analyzer 
> made from the WhitespaceTokenizer and the LowerCaseFilter.
> This works fine for me until I try searching for a phrase. I 
> am searching for a simple phrase containing two words and 
> with double-quotes around it. I have found the phrase in one 
> of the texts so I know it should return at least one result, 
> but none is found. If I remove the double-quotes and searches 
> for the 2 words with AND between them I do find the story.
> Can anyone tell me if this is an obvious (side-)effect of not 
> using the standard analyzer? And is there a better solution 
> to my problem than using the very simple analyzer?
> Best regards
> Peter Vestergaard
> PS: I use the same analyzer for both searching and indexing 
> (of course).

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message