lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Combs, Craig" <Craig.Co...@compuware.com>
Subject RE: Why are tokens not being indexed?
Date Mon, 05 Dec 2005 13:20:12 GMT

This is very mysterious

I have check my parser and I'm returned body:<token>.  My analyzer during
indexing returns <token> in the token stream.  But when I perform my search
no results are found.

Is there a way I can see what tokens are actually written by the index
writer of lucene?

My analyzer returns the tokens and my queryparser returns the tokens so I'm
not sure why "SOME" tokens are not being found in the index.  These are
tokens in the middle of a token stream so it's not like they are at the end
or beginning, and I have not found a pattern to them yet.

Any help would be much appreciated.

-Craig

-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
Sent: Wednesday, November 30, 2005 3:41 PM
To: java-user@lucene.apache.org
Subject: Re: Why are tokens not being indexed?


On 30 Nov 2005, at 10:53, Combs, Craig wrote:
> I wrote my own analyzer:  When I view the tokens returned from the
> StemFilter the words I'm searching for are returned from the "Token  
> Next()"
> function.  My terms are under 10,000. I have used luke before how  
> can I make
> it use a custom analyzer?  Thank you for the information.
>
> .... <general code and default stop words> ....
> 	public final TokenStream tokenStream(String fieldName, Reader
> reader) {
> 		TokenStream result = new StandardTokenizer( reader );
> 		result = new LowerCaseFilter( result );
> 		result = new StopFilter( result, stoptable );
> 		result = new ISOLatin1AccentFilter( result );
> 		result = new ItalianStemFilter( result, excltable );
> 		return result;
> 	}
> }
>

If the tokens you're searching for are returned from your analyzer,  
then they will be searchable.  If you're using QueryParser for  
searching, then you need to be sure your Query is being constructed  
as you'd like.

To get Luke to use a custom Analyzer, it should have a no-arg  
constructor, and be put in the classpath that you launch Luke from  
and it should appear in the list.

Hope this helps.

	Erik



>
> -----Original Message-----
> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
> Sent: Wednesday, November 30, 2005 8:47 AM
> To: java-user@lucene.apache.org
> Subject: Re: Why are tokens not being indexed?
>
> What Analyzer are you using?   Have a look at the Analyzer demo with
> Lucene in Action's code, or from my java.net article so you can
> analyze your analyzer.  Also try out Luke, it really is handy for
> seeing inside your index.
>
> And is your text really long (> 10,000 terms)?   If so, you'll need
> to set the max. field length higher on IndexWriter (I always use
> Integer.MAX_VALUE).
>
> 	Erik
>
>
>
> On 30 Nov 2005, at 08:04, Combs, Craig wrote:
>
>> I have a body of text which is being added to a document as
>> unstored.  All
>> the words in the body text are coming through in the token stream for
>> analyzing.  For some reason I can search on some of tokens and
>> others I can
>> not.
>>
>> Take the following string:
>>
>> "L'amministrazione di Uniface View consente un elevato grado di
>> flessibilità
>> e può essere realizzata in base ai requisiti del proprio sito. Nel
>> manuale
>> Guida all'amministrazione di Uniface View  vengono descritte le
>> procedure di
>> amministrazione del sistema. Utilizzare le informazioni riportate
>> di seguito
>> per creare il portale più adatto alle proprie esigenze."
>>
>> If I search for "consente" or "elevato" no results are found.  If I
>> search
>> for "view" or "grado" results are found.  I find an explanation for
>> this
>> behavior.  These tokens are being returned to the reader but don't
>> seem to
>> be making it into the index.
>>
>> Any ideas on how to debugs or an explanation why this might be
>> happening?
>>
>> Regards,
>> Craig Combs
>>
>>
>>
>>
>> The contents of this e-mail are intended for the named addressee
>> only. It
>> contains information that may be confidential. Unless you are the
>> named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose
>> it to anyone else. If you received it in error please notify us
>> immediately
>> and then destroy it.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> The contents of this e-mail are intended for the named addressee  
> only. It
> contains information that may be confidential. Unless you are the  
> named
> addressee or an authorized designee, you may not copy or use it, or  
> disclose
> it to anyone else. If you received it in error please notify us  
> immediately
> and then destroy it.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it. 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message