lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amin Mohammed-Coleman <ami...@gmail.com>
Subject Re: Search Problem
Date Fri, 02 Jan 2009 23:13:04 GMT
Hi Erick

Thanks for your reply.

I have used luke to inspect the document and I am some what confused.   
For example when I view the index using the overview tab of Luke I get  
the following:

1	body	test
1	id	1234
1	name	rtfDocumentToIndex.rtf
1	path	rtfDocumentToIndex.rtf
1	summary	This is a
1	type	RTF_INDEXER
1	body	rtf


However when I view the document in the Document tab I get the full  
text that was extracted from the rft document (field:body) which is:

This is a test rtf document that will be indexed.
Amin Mohammed-Coleman

I am using the StandardAnaylzer therefore I wouldnt expect the words  
document, indexed, Amin Mohammed-Coleman to be removed.

I have referenced the Lucene In Action book and I can't see what I may  
be doing wrong.  I would be happy to provide a testcase should it be  
required.  When adding the body field to the document I am doing:

	Document document = new Document();
			Field field = new Field(FieldNameEnum.BODY.getDescription(),  
bodyText.trim(), Field.Store.YES, Field.Index.ANALYZED);
			document.add(field);
		


When I run the search code the string "test" is the only word that  
returns a result (TopDocs), whereas the others do not (e.g. "amin",  
"document", "indexed").

Thanks again for your help and advice.


Cheers
Amin



On 2 Jan 2009, at 21:20, Erick Erickson wrote:

> Casing is usually handled by the analyzer. Since you construct
> the term query programmatically, it doesn't go through
> any analyzers, thus is not converted into lower case for
> searching as was done automatically for you when you
> indexed using StandardAnalyzer.
>
> As for why you aren't getting hits, it's unclear to me. But
> what I'd do is get a copy of Luke and examine your index
> to see what's *really* there. This will often give you clues,
> usually pointing to some kind of analyzer behavior that you
> weren't expecting.
>
> Best
> Erick
>
> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com 
> >wrote:
>
>> Hi
>>
>> I have tried this and it doesn't work.  I don't understand why  
>> using "amin"
>> instead of "Amin" would work, is it not case insensitive?
>>
>> I tried "test" for field "body" and this works.  Any other terms  
>> don't work
>> for example:
>>
>> "document"
>> "indexed"
>>
>> these are tokens that were extracted when creating the lucene  
>> document.
>>
>>
>> Thanks for your reply.
>>
>> Cheers
>>
>> Amin
>>
>>
>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>
>> Basically Lucene stores analyzed tokens, and looks up for the matches
>>> based
>>> on the tokens.
>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>> Term("body",
>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>
>>> --
>>> Chris Lu
>>> -------------------------
>>> Instant Scalable Full-Text Search On Any Database/Application
>>> site: http://www.dbsight.net
>>> demo: http://search.dbsight.com
>>> Lucene Database Search in 3 minutes:
>>>
>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>> DBSight customer, a shopping comparison site, (anonymous per  
>>> request) got
>>> 2.6 Million Euro funding!
>>>
>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>> wrote:
>>>
>>> Hi
>>>>
>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>>
>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>
>>>> You need to let us know the analyzer you are using.
>>>>
>>>>> -- Chris Lu
>>>>> -------------------------
>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>> site: http://www.dbsight.net
>>>>> demo: http://search.dbsight.com
>>>>> Lucene Database Search in 3 minutes:
>>>>>
>>>>>
>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>> request)
>>>>> got
>>>>> 2.6 Million Euro funding!
>>>>>
>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi
>>>>>>
>>>>>>>
>>>>>>> I have created a RTFHandler which takes a RTF file and creates
a
>>>>>>> lucene
>>>>>>> Document which is indexed.  The RTFHandler looks like  
>>>>>>> something like
>>>>>>> this:
>>>>>>>
>>>>>>> if (bodyText != null) {
>>>>>>>                   Document document = new Document();
>>>>>>>                   Field field = new
>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>> Field.Store.YES,
>>>>>>> Field.Index.ANALYZED);
>>>>>>>                   document.add(field);
>>>>>>>
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> I am using Java Built in RTF text extraction.  When I run my
 
>>>>>>> test to
>>>>>>> verify that the document contains text that I expect this  
>>>>>>> works fine.
>>>>>>> I
>>>>>>> get
>>>>>>> the following when I print the document:
>>>>>>>
>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This
is a  
>>>>>>> test rtf
>>>>>>> document that will be indexed.
>>>>>>>
>>>>>>> Amin Mohammed-Coleman>
>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>
>>>>>>>
>>>>>>> The problem is when I use the following to search I get no  
>>>>>>> result:
>>>>>>>
>>>>>>>   MultiSearcher multiSearcher = new MultiSearcher(new  
>>>>>>> Searchable[]
>>>>>>> {rtfIndexSearcher});
>>>>>>>                   Term t = new Term("body", "Amin");
>>>>>>>                   TermQuery termQuery = new TermQuery(t);
>>>>>>>                   TopDocs topDocs =  
>>>>>>> multiSearcher.search(termQuery,
>>>>>>> 1);
>>>>>>>                   System.out.println(topDocs.totalHits);
>>>>>>>                   multiSearcher.close();
>>>>>>>
>>>>>>> RftIndexSearcher is configured with the directory that holds
rtf
>>>>>>> documents.  I have used Luke to look at the document and what
 
>>>>>>> I am
>>>>>>> finding
>>>>>>> in the overview tab is the following for the document:
>>>>>>>
>>>>>>> 1       body    test
>>>>>>> 1       id      1234
>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>> 1       summary This is a
>>>>>>> 1       type    RTF_INDEXER
>>>>>>> 1       body    rtf
>>>>>>>
>>>>>>>
>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>
>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>
>>>>>>> Amin Mohammed-Coleman
>>>>>>>
>>>>>>>
>>>>>>> I would expect to get a hit using "Amin" or even "document".
  
>>>>>>> I am not
>>>>>>> sure whether the
>>>>>>> line:
>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>
>>>>>>> is incorrect as I am not too sure of the meaning of "Finds the
 
>>>>>>> top n
>>>>>>> hits
>>>>>>> for query." for search (Query query, int n) according to java
 
>>>>>>> docs.
>>>>>>>
>>>>>>> I would be grateful if someone may be able to advise on what
I  
>>>>>>> may be
>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>> Amin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message