lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shashi Kant <shashi_k...@yahoo.com>
Subject Re: Search Problem
Date Sat, 03 Jan 2009 13:48:22 GMT
Amin,

Are you calling Close & Optimize after every addDocument?

I would suggest something like this
try
{
      while //this could be your looping through a data reader for example
       {
            indexWriter.addDocument(document);
       }
}

finally
{
  commitAndOptimise()
}


HTH

Shashi


----- Original Message ----
From: Amin Mohammed-Coleman <aminmc@gmail.com>
To: java-user@lucene.apache.org
Sent: Saturday, January 3, 2009 4:02:52 AM
Subject: Re: Search Problem


Hi again!

I think I may have found the problem but I was wondering if you could verify:

I have the following for my indexer:

public void add(Document document) {
        IndexWriter indexWriter = IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
        try {
            indexWriter.addDocument(document);
            LOGGER.debug("Added Document:" + document + " to index");
            commitAndOptimise(indexWriter);
        } catch (CorruptIndexException e) {
            throw new IllegalStateException(e);
        } catch (IOException e) {
            throw new IllegalStateException(e);
        }
    }

the commitAndOptimise(indexWriter) looks like this:

private void commitAndOptimise(IndexWriter indexWriter) throws CorruptIndexException,IOException
{
        LOGGER.debug("Committing document and closing index writer");
        indexWriter.optimize();
        indexWriter.commit();
        indexWriter.close();
    }

It seems as though if I comment out optimize then the overview tab in Luke  for the rtf document
looks like:

5    id    1234
3    body    document
3    body    body
1    body    test
1    body    rtf
1    name    rtfDocumentToIndex.rtf
1    body    new
1    path    rtfDocumentToIndex.rtf
1    summary    This is a
1    type    RTF_INDEXER
1    body    content


This is more what I expected although "Amin Mohammed-Coleman" hasn't been stored in the index.
 Should I not be using indexWriter.optimize() ?

I tried using the search function in luke and got the following results:
body:test ---> returns result
body:document ---> no result
body:content ---> no result
body:rtf ----> returns result


Thanks again...sorry to be sending so many emails about this. I am in the process of designing
and developing a prototype of a document and domain indexing/searching component and I would
like to demo to the rest of my team.


Cheers
Amin



On 3 Jan 2009, at 01:23, Erick Erickson wrote:

> Well, your query results are consistent with what Luke is
> reporting. So I'd go back and test your assumptions. I
> suspect that you're not indexing what you think you are.
> 
> For your test document, I'd just print out what you're indexing
> and the field it's going into. *for each field*. that is, every time you
> do a document.add(<field of some kind>), print out that data. I'm
> pretty sure you'll find that you're not getting what you expect. For
> instance, the call to:
> 
> MetaDataEnum.BODY.getDescription()
> 
> may be returning some nonsense. Or
> bodyText.trim()
> 
> isn't doing what you expect.
> 
> Lucene is used by many folks, and errors of the magnitude you're
> experiencing would be seen by many people and the user list would
> be flooded with complaints if it were a Lucene issue at root. That
> leaves the code you wrote as the most likely culprit. So try a very simple
> test case with lots of debugging println's. I'm pretty sure you'll
> find the underlying issue with some of your assumptions pretty quickly.
> 
> Sorry I can't be more specific, but we'd have to see all of your code
> and the test cases to do that....
> 
> Best
> Erick
> 
> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com>wrote:
> 
>> Hi Erick
>> 
>> Thanks for your reply.
>> 
>> I have used luke to inspect the document and I am some what confused.  For
>> example when I view the index using the overview tab of Luke I get the
>> following:
>> 
>> 1       body    test
>> 1       id      1234
>> 1       name    rtfDocumentToIndex.rtf
>> 1       path    rtfDocumentToIndex.rtf
>> 1       summary This is a
>> 1       type    RTF_INDEXER
>> 1       body    rtf
>> 
>> 
>> However when I view the document in the Document tab I get the full text
>> that was extracted from the rft document (field:body) which is:
>> 
>> This is a test rtf document that will be indexed.
>> Amin Mohammed-Coleman
>> 
>> I am using the StandardAnaylzer therefore I wouldnt expect the words
>> document, indexed, Amin Mohammed-Coleman to be removed.
>> 
>> I have referenced the Lucene In Action book and I can't see what I may be
>> doing wrong.  I would be happy to provide a testcase should it be required.
>> When adding the body field to the document I am doing:
>> 
>>       Document document = new Document();
>>                       Field field = new
>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(), Field.Store.YES,
>> Field.Index.ANALYZED);
>>                       document.add(field);
>> 
>> 
>> 
>> When I run the search code the string "test" is the only word that returns
>> a result (TopDocs), whereas the others do not (e.g. "amin", "document",
>> "indexed").
>> 
>> Thanks again for your help and advice.
>> 
>> 
>> Cheers
>> Amin
>> 
>> 
>> 
>> 
>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>> 
>> Casing is usually handled by the analyzer. Since you construct
>>> the term query programmatically, it doesn't go through
>>> any analyzers, thus is not converted into lower case for
>>> searching as was done automatically for you when you
>>> indexed using StandardAnalyzer.
>>> 
>>> As for why you aren't getting hits, it's unclear to me. But
>>> what I'd do is get a copy of Luke and examine your index
>>> to see what's *really* there. This will often give you clues,
>>> usually pointing to some kind of analyzer behavior that you
>>> weren't expecting.
>>> 
>>> Best
>>> Erick
>>> 
>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>> wrote:
>>> 
>>> Hi
>>>> 
>>>> I have tried this and it doesn't work.  I don't understand why using
>>>> "amin"
>>>> instead of "Amin" would work, is it not case insensitive?
>>>> 
>>>> I tried "test" for field "body" and this works.  Any other terms don't
>>>> work
>>>> for example:
>>>> 
>>>> "document"
>>>> "indexed"
>>>> 
>>>> these are tokens that were extracted when creating the lucene document.
>>>> 
>>>> 
>>>> Thanks for your reply.
>>>> 
>>>> Cheers
>>>> 
>>>> Amin
>>>> 
>>>> 
>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>> 
>>>> Basically Lucene stores analyzed tokens, and looks up for the matches
>>>> 
>>>>> based
>>>>> on the tokens.
>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>> Term("body",
>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>> 
>>>>> --
>>>>> Chris Lu
>>>>> -------------------------
>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>> site: http://www.dbsight.net
>>>>> demo: http://search.dbsight.com
>>>>> Lucene Database Search in 3 minutes:
>>>>> 
>>>>> 
>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>>> got
>>>>> 2.6 Million Euro funding!
>>>>> 
>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>> aminmc@gmail.com
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>> 
>>>>> Hi
>>>>> 
>>>>>> 
>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>> 
>>>>>> You need to let us know the analyzer you are using.
>>>>>> 
>>>>>> -- Chris Lu
>>>>>>> -------------------------
>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>> site: http://www.dbsight.net
>>>>>>> demo: http://search.dbsight.com
>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>> DBSight customer, a shopping comparison site, (anonymous per
request)
>>>>>>> got
>>>>>>> 2.6 Million Euro funding!
>>>>>>> 
>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>> aminmc@gmail.com
>>>>>>> 
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> I have created a RTFHandler which takes a RTF file and
creates a
>>>>>>>>> lucene
>>>>>>>>> Document which is indexed.  The RTFHandler looks like
something like
>>>>>>>>> this:
>>>>>>>>> 
>>>>>>>>> if (bodyText != null) {
>>>>>>>>>                 Document document = new Document();
>>>>>>>>>                 Field field = new
>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>> Field.Store.YES,
>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>                 document.add(field);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> I am using Java Built in RTF text extraction.  When I
run my test to
>>>>>>>>> verify that the document contains text that I expect
this works
>>>>>>>>> fine.
>>>>>>>>> I
>>>>>>>>> get
>>>>>>>>> the following when I print the document:
>>>>>>>>> 
>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This
is a test
>>>>>>>>> rtf
>>>>>>>>> document that will be indexed.
>>>>>>>>> 
>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The problem is when I use the following to search I get
no result:
>>>>>>>>> 
>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new Searchable[]
>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>                 Term t = new Term("body", "Amin");
>>>>>>>>>                 TermQuery termQuery = new TermQuery(t);
>>>>>>>>>                 TopDocs topDocs = multiSearcher.search(termQuery,
>>>>>>>>> 1);
>>>>>>>>>                 System.out.println(topDocs.totalHits);
>>>>>>>>>                 multiSearcher.close();
>>>>>>>>> 
>>>>>>>>> RftIndexSearcher is configured with the directory that
holds rtf
>>>>>>>>> documents.  I have used Luke to look at the document
and what I am
>>>>>>>>> finding
>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>> 
>>>>>>>>> 1       body    test
>>>>>>>>> 1       id      1234
>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>> 1       summary This is a
>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>> 1       body    rtf
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> However on the Document tab I am getting (in the body
field):
>>>>>>>>> 
>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>> 
>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I would expect to get a hit using "Amin" or even "document".
 I am
>>>>>>>>> not
>>>>>>>>> sure whether the
>>>>>>>>> line:
>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>> 
>>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds
the top n
>>>>>>>>> hits
>>>>>>>>> for query." for search (Query query, int n) according
to java docs.
>>>>>>>>> 
>>>>>>>>> I would be grateful if someone may be able to advise
on what I may
>>>>>>>>> be
>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> Amin
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message