lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amin Mohammed-Coleman <ami...@gmail.com>
Subject Re: Search Problem
Date Sat, 03 Jan 2009 14:01:36 GMT
Hi

I am currently doing this as the indexer will be called from a upload  
action. There is no bulk file processing functioaliry at the moment.


Cheers

Sent from my iPhone

On 3 Jan 2009, at 13:48, Shashi Kant <shashi_kant@yahoo.com> wrote:

> Amin,
>
> Are you calling Close & Optimize after every addDocument?
>
> I would suggest something like this
> try
> {
>      while //this could be your looping through a data reader for  
> example
>       {
>            indexWriter.addDocument(document);
>       }
> }
>
> finally
> {
>  commitAndOptimise()
> }
>
>
> HTH
>
> Shashi
>
>
> ----- Original Message ----
> From: Amin Mohammed-Coleman <aminmc@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Saturday, January 3, 2009 4:02:52 AM
> Subject: Re: Search Problem
>
>
> Hi again!
>
> I think I may have found the problem but I was wondering if you  
> could verify:
>
> I have the following for my indexer:
>
> public void add(Document document) {
>        IndexWriter indexWriter =  
> IndexWriterFactory.createIndexWriter(getDirectory(), getAnalyzer());
>        try {
>            indexWriter.addDocument(document);
>            LOGGER.debug("Added Document:" + document + " to index");
>            commitAndOptimise(indexWriter);
>        } catch (CorruptIndexException e) {
>            throw new IllegalStateException(e);
>        } catch (IOException e) {
>            throw new IllegalStateException(e);
>        }
>    }
>
> the commitAndOptimise(indexWriter) looks like this:
>
> private void commitAndOptimise(IndexWriter indexWriter) throws  
> CorruptIndexException,IOException {
>        LOGGER.debug("Committing document and closing index writer");
>        indexWriter.optimize();
>        indexWriter.commit();
>        indexWriter.close();
>    }
>
> It seems as though if I comment out optimize then the overview tab  
> in Luke  for the rtf document looks like:
>
> 5    id    1234
> 3    body    document
> 3    body    body
> 1    body    test
> 1    body    rtf
> 1    name    rtfDocumentToIndex.rtf
> 1    body    new
> 1    path    rtfDocumentToIndex.rtf
> 1    summary    This is a
> 1    type    RTF_INDEXER
> 1    body    content
>
>
> This is more what I expected although "Amin Mohammed-Coleman" hasn't  
> been stored in the index.  Should I not be using  
> indexWriter.optimize() ?
>
> I tried using the search function in luke and got the following  
> results:
> body:test ---> returns result
> body:document ---> no result
> body:content ---> no result
> body:rtf ----> returns result
>
>
> Thanks again...sorry to be sending so many emails about this. I am  
> in the process of designing and developing a prototype of a document  
> and domain indexing/searching component and I would like to demo to  
> the rest of my team.
>
>
> Cheers
> Amin
>
>
>
> On 3 Jan 2009, at 01:23, Erick Erickson wrote:
>
>> Well, your query results are consistent with what Luke is
>> reporting. So I'd go back and test your assumptions. I
>> suspect that you're not indexing what you think you are.
>>
>> For your test document, I'd just print out what you're indexing
>> and the field it's going into. *for each field*. that is, every  
>> time you
>> do a document.add(<field of some kind>), print out that data. I'm
>> pretty sure you'll find that you're not getting what you expect. For
>> instance, the call to:
>>
>> MetaDataEnum.BODY.getDescription()
>>
>> may be returning some nonsense. Or
>> bodyText.trim()
>>
>> isn't doing what you expect.
>>
>> Lucene is used by many folks, and errors of the magnitude you're
>> experiencing would be seen by many people and the user list would
>> be flooded with complaints if it were a Lucene issue at root. That
>> leaves the code you wrote as the most likely culprit. So try a very  
>> simple
>> test case with lots of debugging println's. I'm pretty sure you'll
>> find the underlying issue with some of your assumptions pretty  
>> quickly.
>>
>> Sorry I can't be more specific, but we'd have to see all of your code
>> and the test cases to do that....
>>
>> Best
>> Erick
>>
>> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com 
>> >wrote:
>>
>>> Hi Erick
>>>
>>> Thanks for your reply.
>>>
>>> I have used luke to inspect the document and I am some what  
>>> confused.  For
>>> example when I view the index using the overview tab of Luke I get  
>>> the
>>> following:
>>>
>>> 1       body    test
>>> 1       id      1234
>>> 1       name    rtfDocumentToIndex.rtf
>>> 1       path    rtfDocumentToIndex.rtf
>>> 1       summary This is a
>>> 1       type    RTF_INDEXER
>>> 1       body    rtf
>>>
>>>
>>> However when I view the document in the Document tab I get the  
>>> full text
>>> that was extracted from the rft document (field:body) which is:
>>>
>>> This is a test rtf document that will be indexed.
>>> Amin Mohammed-Coleman
>>>
>>> I am using the StandardAnaylzer therefore I wouldnt expect the words
>>> document, indexed, Amin Mohammed-Coleman to be removed.
>>>
>>> I have referenced the Lucene In Action book and I can't see what I  
>>> may be
>>> doing wrong.  I would be happy to provide a testcase should it be  
>>> required.
>>> When adding the body field to the document I am doing:
>>>
>>>      Document document = new Document();
>>>                      Field field = new
>>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),  
>>> Field.Store.YES,
>>> Field.Index.ANALYZED);
>>>                      document.add(field);
>>>
>>>
>>>
>>> When I run the search code the string "test" is the only word that  
>>> returns
>>> a result (TopDocs), whereas the others do not (e.g. "amin",  
>>> "document",
>>> "indexed").
>>>
>>> Thanks again for your help and advice.
>>>
>>>
>>> Cheers
>>> Amin
>>>
>>>
>>>
>>>
>>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>>
>>> Casing is usually handled by the analyzer. Since you construct
>>>> the term query programmatically, it doesn't go through
>>>> any analyzers, thus is not converted into lower case for
>>>> searching as was done automatically for you when you
>>>> indexed using StandardAnalyzer.
>>>>
>>>> As for why you aren't getting hits, it's unclear to me. But
>>>> what I'd do is get a copy of Luke and examine your index
>>>> to see what's *really* there. This will often give you clues,
>>>> usually pointing to some kind of analyzer behavior that you
>>>> weren't expecting.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>>> wrote:
>>>>
>>>> Hi
>>>>>
>>>>> I have tried this and it doesn't work.  I don't understand why  
>>>>> using
>>>>> "amin"
>>>>> instead of "Amin" would work, is it not case insensitive?
>>>>>
>>>>> I tried "test" for field "body" and this works.  Any other terms  
>>>>> don't
>>>>> work
>>>>> for example:
>>>>>
>>>>> "document"
>>>>> "indexed"
>>>>>
>>>>> these are tokens that were extracted when creating the lucene  
>>>>> document.
>>>>>
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> Cheers
>>>>>
>>>>> Amin
>>>>>
>>>>>
>>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>>
>>>>> Basically Lucene stores analyzed tokens, and looks up for the  
>>>>> matches
>>>>>
>>>>>> based
>>>>>> on the tokens.
>>>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>>>> Term("body",
>>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>>
>>>>>> --
>>>>>> Chris Lu
>>>>>> -------------------------
>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>> site: http://www.dbsight.net
>>>>>> demo: http://search.dbsight.com
>>>>>> Lucene Database Search in 3 minutes:
>>>>>>
>>>>>>
>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>> DBSight customer, a shopping comparison site, (anonymous per  
>>>>>> request)
>>>>>> got
>>>>>> 2.6 Million Euro funding!
>>>>>>
>>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>>>> aminmc@gmail.com
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>>>
>>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>>
>>>>>>> You need to let us know the analyzer you are using.
>>>>>>>
>>>>>>> -- Chris Lu
>>>>>>>> -------------------------
>>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>>> site: http://www.dbsight.net
>>>>>>>> demo: http://search.dbsight.com
>>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>>> DBSight customer, a shopping comparison site, (anonymous
per  
>>>>>>>> request)
>>>>>>>> got
>>>>>>>> 2.6 Million Euro funding!
>>>>>>>>
>>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>>>> aminmc@gmail.com
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I have created a RTFHandler which takes a RTF file
and  
>>>>>>>>>> creates a
>>>>>>>>>> lucene
>>>>>>>>>> Document which is indexed.  The RTFHandler looks
like  
>>>>>>>>>> something like
>>>>>>>>>> this:
>>>>>>>>>>
>>>>>>>>>> if (bodyText != null) {
>>>>>>>>>>                Document document = new Document();
>>>>>>>>>>                Field field = new
>>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>>>> Field.Store.YES,
>>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>>                document.add(field);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> I am using Java Built in RTF text extraction.  When
I run  
>>>>>>>>>> my test to
>>>>>>>>>> verify that the document contains text that I expect
this  
>>>>>>>>>> works
>>>>>>>>>> fine.
>>>>>>>>>> I
>>>>>>>>>> get
>>>>>>>>>> the following when I print the document:
>>>>>>>>>>
>>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This
is  
>>>>>>>>>> a test
>>>>>>>>>> rtf
>>>>>>>>>> document that will be indexed.
>>>>>>>>>>
>>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>>> stored/uncompressed,indexed<summary:This is a
>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The problem is when I use the following to search
I get no  
>>>>>>>>>> result:
>>>>>>>>>>
>>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new
 
>>>>>>>>>> Searchable[]
>>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>>                Term t = new Term("body", "Amin");
>>>>>>>>>>                TermQuery termQuery = new TermQuery(t);
>>>>>>>>>>                TopDocs topDocs =  
>>>>>>>>>> multiSearcher.search(termQuery,
>>>>>>>>>> 1);
>>>>>>>>>>                System.out.println(topDocs.totalHits);
>>>>>>>>>>                multiSearcher.close();
>>>>>>>>>>
>>>>>>>>>> RftIndexSearcher is configured with the directory
that  
>>>>>>>>>> holds rtf
>>>>>>>>>> documents.  I have used Luke to look at the document
and  
>>>>>>>>>> what I am
>>>>>>>>>> finding
>>>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>>>
>>>>>>>>>> 1       body    test
>>>>>>>>>> 1       id      1234
>>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>>> 1       summary This is a
>>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>>> 1       body    rtf
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> However on the Document tab I am getting (in the
body field):
>>>>>>>>>>
>>>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>>>
>>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I would expect to get a hit using "Amin" or even
 
>>>>>>>>>> "document".  I am
>>>>>>>>>> not
>>>>>>>>>> sure whether the
>>>>>>>>>> line:
>>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery,
1);
>>>>>>>>>>
>>>>>>>>>> is incorrect as I am not too sure of the meaning
of "Finds  
>>>>>>>>>> the top n
>>>>>>>>>> hits
>>>>>>>>>> for query." for search (Query query, int n) according
to  
>>>>>>>>>> java docs.
>>>>>>>>>>
>>>>>>>>>> I would be grateful if someone may be able to advise
on  
>>>>>>>>>> what I may
>>>>>>>>>> be
>>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Amin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- 
>>>>>>>>> --- 
>>>>>>>>> --- 
>>>>>>>>> ------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --- 
>>>>> ------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message