lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amin Mohammed-Coleman <ami...@gmail.com>
Subject Re: Search Problem
Date Sat, 03 Jan 2009 09:02:52 GMT

Hi again!

I think I may have found the problem but I was wondering if you could  
verify:

I have the following for my indexer:

public void add(Document document) {
		IndexWriter indexWriter =  
IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
		try {
			indexWriter.addDocument(document);
			LOGGER.debug("Added Document:" + document + " to index");
			commitAndOptimise(indexWriter);
		} catch (CorruptIndexException e) {
			throw new IllegalStateException(e);
		} catch (IOException e) {
			throw new IllegalStateException(e);
		}
	}

the commitAndOptimise(indexWriter) looks like this:

private void commitAndOptimise(IndexWriter indexWriter) throws  
CorruptIndexException,IOException {
		LOGGER.debug("Committing document and closing index writer");
		indexWriter.optimize();
		indexWriter.commit();
		indexWriter.close();
	}

It seems as though if I comment out optimize then the overview tab in  
Luke  for the rtf document looks like:

5	id	1234
3	body	document
3	body	body
1	body	test
1	body	rtf
1	name	rtfDocumentToIndex.rtf
1	body	new
1	path	rtfDocumentToIndex.rtf
1	summary	This is a
1	type	RTF_INDEXER
1	body	content


This is more what I expected although "Amin Mohammed-Coleman" hasn't  
been stored in the index.  Should I not be using  
indexWriter.optimize() ?

I tried using the search function in luke and got the following results:
body:test ---> returns result
body:document ---> no result
body:content ---> no result
body:rtf ----> returns result


Thanks again...sorry to be sending so many emails about this. I am in  
the process of designing and developing a prototype of a document and  
domain indexing/searching component and I would like to demo to the  
rest of my team.


Cheers
Amin



On 3 Jan 2009, at 01:23, Erick Erickson wrote:

> Well, your query results are consistent with what Luke is
> reporting. So I'd go back and test your assumptions. I
> suspect that you're not indexing what you think you are.
>
> For your test document, I'd just print out what you're indexing
> and the field it's going into. *for each field*. that is, every time  
> you
> do a document.add(<field of some kind>), print out that data. I'm
> pretty sure you'll find that you're not getting what you expect. For
> instance, the call to:
>
> MetaDataEnum.BODY.getDescription()
>
> may be returning some nonsense. Or
> bodyText.trim()
>
> isn't doing what you expect.
>
> Lucene is used by many folks, and errors of the magnitude you're
> experiencing would be seen by many people and the user list would
> be flooded with complaints if it were a Lucene issue at root. That
> leaves the code you wrote as the most likely culprit. So try a very  
> simple
> test case with lots of debugging println's. I'm pretty sure you'll
> find the underlying issue with some of your assumptions pretty  
> quickly.
>
> Sorry I can't be more specific, but we'd have to see all of your code
> and the test cases to do that....
>
> Best
> Erick
>
> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
> >wrote:
>
>> Hi Erick
>>
>> Thanks for your reply.
>>
>> I have used luke to inspect the document and I am some what  
>> confused.  For
>> example when I view the index using the overview tab of Luke I get  
>> the
>> following:
>>
>> 1       body    test
>> 1       id      1234
>> 1       name    rtfDocumentToIndex.rtf
>> 1       path    rtfDocumentToIndex.rtf
>> 1       summary This is a
>> 1       type    RTF_INDEXER
>> 1       body    rtf
>>
>>
>> However when I view the document in the Document tab I get the full  
>> text
>> that was extracted from the rft document (field:body) which is:
>>
>> This is a test rtf document that will be indexed.
>> Amin Mohammed-Coleman
>>
>> I am using the StandardAnaylzer therefore I wouldnt expect the words
>> document, indexed, Amin Mohammed-Coleman to be removed.
>>
>> I have referenced the Lucene In Action book and I can't see what I  
>> may be
>> doing wrong.  I would be happy to provide a testcase should it be  
>> required.
>> When adding the body field to the document I am doing:
>>
>>       Document document = new Document();
>>                       Field field = new
>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),  
>> Field.Store.YES,
>> Field.Index.ANALYZED);
>>                       document.add(field);
>>
>>
>>
>> When I run the search code the string "test" is the only word that  
>> returns
>> a result (TopDocs), whereas the others do not (e.g. "amin",  
>> "document",
>> "indexed").
>>
>> Thanks again for your help and advice.
>>
>>
>> Cheers
>> Amin
>>
>>
>>
>>
>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>
>> Casing is usually handled by the analyzer. Since you construct
>>> the term query programmatically, it doesn't go through
>>> any analyzers, thus is not converted into lower case for
>>> searching as was done automatically for you when you
>>> indexed using StandardAnalyzer.
>>>
>>> As for why you aren't getting hits, it's unclear to me. But
>>> what I'd do is get a copy of Luke and examine your index
>>> to see what's *really* there. This will often give you clues,
>>> usually pointing to some kind of analyzer behavior that you
>>> weren't expecting.
>>>
>>> Best
>>> Erick
>>>
>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>> wrote:
>>>
>>> Hi
>>>>
>>>> I have tried this and it doesn't work.  I don't understand why  
>>>> using
>>>> "amin"
>>>> instead of "Amin" would work, is it not case insensitive?
>>>>
>>>> I tried "test" for field "body" and this works.  Any other terms  
>>>> don't
>>>> work
>>>> for example:
>>>>
>>>> "document"
>>>> "indexed"
>>>>
>>>> these are tokens that were extracted when creating the lucene  
>>>> document.
>>>>
>>>>
>>>> Thanks for your reply.
>>>>
>>>> Cheers
>>>>
>>>> Amin
>>>>
>>>>
>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>
>>>> Basically Lucene stores analyzed tokens, and looks up for the  
>>>> matches
>>>>
>>>>> based
>>>>> on the tokens.
>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>> Term("body",
>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>
>>>>> --
>>>>> Chris Lu
>>>>> -------------------------
>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>> site: http://www.dbsight.net
>>>>> demo: http://search.dbsight.com
>>>>> Lucene Database Search in 3 minutes:
>>>>>
>>>>>
>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>> request)
>>>>> got
>>>>> 2.6 Million Euro funding!
>>>>>
>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>> aminmc@gmail.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>> Hi
>>>>>
>>>>>>
>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>
>>>>>> You need to let us know the analyzer you are using.
>>>>>>
>>>>>> -- Chris Lu
>>>>>>> -------------------------
>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>> site: http://www.dbsight.net
>>>>>>> demo: http://search.dbsight.com
>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>> DBSight customer, a shopping comparison site, (anonymous per
 
>>>>>>> request)
>>>>>>> got
>>>>>>> 2.6 Million Euro funding!
>>>>>>>
>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>> aminmc@gmail.com
>>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi
>>>>>>>>
>>>>>>>>
>>>>>>>>> I have created a RTFHandler which takes a RTF file and
 
>>>>>>>>> creates a
>>>>>>>>> lucene
>>>>>>>>> Document which is indexed.  The RTFHandler looks like
 
>>>>>>>>> something like
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> if (bodyText != null) {
>>>>>>>>>                 Document document = new Document();
>>>>>>>>>                 Field field = new
>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>> Field.Store.YES,
>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>                 document.add(field);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> I am using Java Built in RTF text extraction.  When I
run my  
>>>>>>>>> test to
>>>>>>>>> verify that the document contains text that I expect
this  
>>>>>>>>> works
>>>>>>>>> fine.
>>>>>>>>> I
>>>>>>>>> get
>>>>>>>>> the following when I print the document:
>>>>>>>>>
>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This
is  
>>>>>>>>> a test
>>>>>>>>> rtf
>>>>>>>>> document that will be indexed.
>>>>>>>>>
>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem is when I use the following to search I get
no  
>>>>>>>>> result:
>>>>>>>>>
>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new 

>>>>>>>>> Searchable[]
>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>                 Term t = new Term("body", "Amin");
>>>>>>>>>                 TermQuery termQuery = new TermQuery(t);
>>>>>>>>>                 TopDocs topDocs =  
>>>>>>>>> multiSearcher.search(termQuery,
>>>>>>>>> 1);
>>>>>>>>>                 System.out.println(topDocs.totalHits);
>>>>>>>>>                 multiSearcher.close();
>>>>>>>>>
>>>>>>>>> RftIndexSearcher is configured with the directory that
holds  
>>>>>>>>> rtf
>>>>>>>>> documents.  I have used Luke to look at the document
and  
>>>>>>>>> what I am
>>>>>>>>> finding
>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>>
>>>>>>>>> 1       body    test
>>>>>>>>> 1       id      1234
>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>> 1       summary This is a
>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>> 1       body    rtf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> However on the Document tab I am getting (in the body
field):
>>>>>>>>>
>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>>
>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would expect to get a hit using "Amin" or even  
>>>>>>>>> "document".  I am
>>>>>>>>> not
>>>>>>>>> sure whether the
>>>>>>>>> line:
>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>>
>>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds
 
>>>>>>>>> the top n
>>>>>>>>> hits
>>>>>>>>> for query." for search (Query query, int n) according
to  
>>>>>>>>> java docs.
>>>>>>>>>
>>>>>>>>> I would be grateful if someone may be able to advise
on what  
>>>>>>>>> I may
>>>>>>>>> be
>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Amin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message