lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amin Mohammed-Coleman <ami...@gmail.com>
Subject Re: Search Problem
Date Sat, 03 Jan 2009 20:25:31 GMT
Hi

I have uploaded to google docs:

url: http://docs.google.com/Doc?id=d77xf5q_0n6hb38fx

Hope this works.


Cheers
Amin
On 3 Jan 2009, at 19:53, Grant Ingersoll wrote:

> The mailing list often strips attachments (in fact, I'm surprised  
> your earlier ones made it through).  Perhaps you can put them up  
> somewhere for download.
>
>
> On Jan 3, 2009, at 1:07 PM, Amin Mohammed-Coleman wrote:
>
>> Hi again
>>
>> Sorry I didn't include the WorkItem class!  Here is the final test  
>> case.  Apologies!
>> On 3 Jan 2009, at 14:02, Grant Ingersoll wrote:
>>
>>> You shouldn't need to call close and optimize after each document.
>>>
>>> You also don't need the commit if you are going to immediately  
>>> close.
>>>
>>> Also, can you send a standalone test that shows the RTF  
>>> extraction, the document creation and the indexing code that  
>>> demonstrates your issue.
>>>
>>> FWIW, and as a complete aside to save you some time after you get  
>>> this figured out, instead of re-inventing RTF extraction and PDF  
>>> extraction (as you appear to be doing), have a look at Tika (http://lucene.apache.org/tika

>>> )
>>>
>>> On Jan 3, 2009, at 8:48 AM, Shashi Kant wrote:
>>>
>>>> Amin,
>>>>
>>>> Are you calling Close & Optimize after every addDocument?
>>>>
>>>> I would suggest something like this
>>>> try
>>>> {
>>>>   while //this could be your looping through a data reader for  
>>>> example
>>>>    {
>>>>         indexWriter.addDocument(document);
>>>>    }
>>>> }
>>>>
>>>> finally
>>>> {
>>>> commitAndOptimise()
>>>> }
>>>>
>>>>
>>>> HTH
>>>>
>>>> Shashi
>>>>
>>>>
>>>> ----- Original Message ----
>>>> From: Amin Mohammed-Coleman <aminmc@gmail.com>
>>>> To: java-user@lucene.apache.org
>>>> Sent: Saturday, January 3, 2009 4:02:52 AM
>>>> Subject: Re: Search Problem
>>>>
>>>>
>>>> Hi again!
>>>>
>>>> I think I may have found the problem but I was wondering if you  
>>>> could verify:
>>>>
>>>> I have the following for my indexer:
>>>>
>>>> public void add(Document document) {
>>>>     IndexWriter indexWriter =  
>>>> IndexWriterFactory.createIndexWriter(getDirectory(),  
>>>> getAnalyzer());
>>>>     try {
>>>>         indexWriter.addDocument(document);
>>>>         LOGGER.debug("Added Document:" + document + " to index");
>>>>         commitAndOptimise(indexWriter);
>>>>     } catch (CorruptIndexException e) {
>>>>         throw new IllegalStateException(e);
>>>>     } catch (IOException e) {
>>>>         throw new IllegalStateException(e);
>>>>     }
>>>> }
>>>>
>>>> the commitAndOptimise(indexWriter) looks like this:
>>>>
>>>> private void commitAndOptimise(IndexWriter indexWriter) throws  
>>>> CorruptIndexException,IOException {
>>>>     LOGGER.debug("Committing document and closing index writer");
>>>>     indexWriter.optimize();
>>>>     indexWriter.commit();
>>>>     indexWriter.close();
>>>> }
>>>>
>>>> It seems as though if I comment out optimize then the overview  
>>>> tab in Luke  for the rtf document looks like:
>>>>
>>>> 5    id    1234
>>>> 3    body    document
>>>> 3    body    body
>>>> 1    body    test
>>>> 1    body    rtf
>>>> 1    name    rtfDocumentToIndex.rtf
>>>> 1    body    new
>>>> 1    path    rtfDocumentToIndex.rtf
>>>> 1    summary    This is a
>>>> 1    type    RTF_INDEXER
>>>> 1    body    content
>>>>
>>>>
>>>> This is more what I expected although "Amin Mohammed-Coleman"  
>>>> hasn't been stored in the index.  Should I not be using  
>>>> indexWriter.optimize() ?
>>>>
>>>> I tried using the search function in luke and got the following  
>>>> results:
>>>> body:test ---> returns result
>>>> body:document ---> no result
>>>> body:content ---> no result
>>>> body:rtf ----> returns result
>>>>
>>>>
>>>> Thanks again...sorry to be sending so many emails about this. I  
>>>> am in the process of designing and developing a prototype of a  
>>>> document and domain indexing/searching component and I would like  
>>>> to demo to the rest of my team.
>>>>
>>>>
>>>> Cheers
>>>> Amin
>>>>
>>>>
>>>>
>>>> On 3 Jan 2009, at 01:23, Erick Erickson wrote:
>>>>
>>>>> Well, your query results are consistent with what Luke is
>>>>> reporting. So I'd go back and test your assumptions. I
>>>>> suspect that you're not indexing what you think you are.
>>>>>
>>>>> For your test document, I'd just print out what you're indexing
>>>>> and the field it's going into. *for each field*. that is, every  
>>>>> time you
>>>>> do a document.add(<field of some kind>), print out that data. I'm
>>>>> pretty sure you'll find that you're not getting what you expect.  
>>>>> For
>>>>> instance, the call to:
>>>>>
>>>>> MetaDataEnum.BODY.getDescription()
>>>>>
>>>>> may be returning some nonsense. Or
>>>>> bodyText.trim()
>>>>>
>>>>> isn't doing what you expect.
>>>>>
>>>>> Lucene is used by many folks, and errors of the magnitude you're
>>>>> experiencing would be seen by many people and the user list would
>>>>> be flooded with complaints if it were a Lucene issue at root. That
>>>>> leaves the code you wrote as the most likely culprit. So try a  
>>>>> very simple
>>>>> test case with lots of debugging println's. I'm pretty sure you'll
>>>>> find the underlying issue with some of your assumptions pretty  
>>>>> quickly.
>>>>>
>>>>> Sorry I can't be more specific, but we'd have to see all of your  
>>>>> code
>>>>> and the test cases to do that....
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com

>>>>> >wrote:
>>>>>
>>>>>> Hi Erick
>>>>>>
>>>>>> Thanks for your reply.
>>>>>>
>>>>>> I have used luke to inspect the document and I am some what  
>>>>>> confused.  For
>>>>>> example when I view the index using the overview tab of Luke I  
>>>>>> get the
>>>>>> following:
>>>>>>
>>>>>> 1       body    test
>>>>>> 1       id      1234
>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>> 1       summary This is a
>>>>>> 1       type    RTF_INDEXER
>>>>>> 1       body    rtf
>>>>>>
>>>>>>
>>>>>> However when I view the document in the Document tab I get the  
>>>>>> full text
>>>>>> that was extracted from the rft document (field:body) which is:
>>>>>>
>>>>>> This is a test rtf document that will be indexed.
>>>>>> Amin Mohammed-Coleman
>>>>>>
>>>>>> I am using the StandardAnaylzer therefore I wouldnt expect the  
>>>>>> words
>>>>>> document, indexed, Amin Mohammed-Coleman to be removed.
>>>>>>
>>>>>> I have referenced the Lucene In Action book and I can't see  
>>>>>> what I may be
>>>>>> doing wrong.  I would be happy to provide a testcase should it  
>>>>>> be required.
>>>>>> When adding the body field to the document I am doing:
>>>>>>
>>>>>>   Document document = new Document();
>>>>>>                   Field field = new
>>>>>> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(),  
>>>>>> Field.Store.YES,
>>>>>> Field.Index.ANALYZED);
>>>>>>                   document.add(field);
>>>>>>
>>>>>>
>>>>>>
>>>>>> When I run the search code the string "test" is the only word  
>>>>>> that returns
>>>>>> a result (TopDocs), whereas the others do not (e.g. "amin",  
>>>>>> "document",
>>>>>> "indexed").
>>>>>>
>>>>>> Thanks again for your help and advice.
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>> Amin
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>>>>>>
>>>>>> Casing is usually handled by the analyzer. Since you construct
>>>>>>> the term query programmatically, it doesn't go through
>>>>>>> any analyzers, thus is not converted into lower case for
>>>>>>> searching as was done automatically for you when you
>>>>>>> indexed using StandardAnalyzer.
>>>>>>>
>>>>>>> As for why you aren't getting hits, it's unclear to me. But
>>>>>>> what I'd do is get a copy of Luke and examine your index
>>>>>>> to see what's *really* there. This will often give you clues,
>>>>>>> usually pointing to some kind of analyzer behavior that you
>>>>>>> weren't expecting.
>>>>>>>
>>>>>>> Best
>>>>>>> Erick
>>>>>>>
>>>>>>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>>
>>>>>>>> I have tried this and it doesn't work.  I don't understand
 
>>>>>>>> why using
>>>>>>>> "amin"
>>>>>>>> instead of "Amin" would work, is it not case insensitive?
>>>>>>>>
>>>>>>>> I tried "test" for field "body" and this works.  Any other
 
>>>>>>>> terms don't
>>>>>>>> work
>>>>>>>> for example:
>>>>>>>>
>>>>>>>> "document"
>>>>>>>> "indexed"
>>>>>>>>
>>>>>>>> these are tokens that were extracted when creating the lucene
 
>>>>>>>> document.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for your reply.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> Amin
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>>>>>>
>>>>>>>> Basically Lucene stores analyzed tokens, and looks up for
the  
>>>>>>>> matches
>>>>>>>>
>>>>>>>>> based
>>>>>>>>> on the tokens.
>>>>>>>>> "Amin" after StandardAnalyzer is "amin", so you need
to use  
>>>>>>>>> new
>>>>>>>>> Term("body",
>>>>>>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Chris Lu
>>>>>>>>> -------------------------
>>>>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>>>>> site: http://www.dbsight.net
>>>>>>>>> demo: http://search.dbsight.com
>>>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>>>> DBSight customer, a shopping comparison site, (anonymous
per  
>>>>>>>>> request)
>>>>>>>>> got
>>>>>>>>> 2.6 Million Euro funding!
>>>>>>>>>
>>>>>>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman
<
>>>>>>>>> aminmc@gmail.com
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>>>>>>
>>>>>>>>>> You need to let us know the analyzer you are using.
>>>>>>>>>>
>>>>>>>>>> -- Chris Lu
>>>>>>>>>>> -------------------------
>>>>>>>>>>> Instant Scalable Full-Text Search On Any Database/

>>>>>>>>>>> Application
>>>>>>>>>>> site: http://www.dbsight.net
>>>>>>>>>>> demo: http://search.dbsight.com
>>>>>>>>>>> Lucene Database Search in 3 minutes:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>>>>>>> DBSight customer, a shopping comparison site,
(anonymous  
>>>>>>>>>>> per request)
>>>>>>>>>>> got
>>>>>>>>>>> 2.6 Million Euro funding!
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman
<
>>>>>>>>>>> aminmc@gmail.com
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I have created a RTFHandler which takes
a RTF file and  
>>>>>>>>>>>>> creates a
>>>>>>>>>>>>> lucene
>>>>>>>>>>>>> Document which is indexed.  The RTFHandler
looks like  
>>>>>>>>>>>>> something like
>>>>>>>>>>>>> this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> if (bodyText != null) {
>>>>>>>>>>>>>             Document document = new Document();
>>>>>>>>>>>>>             Field field = new
>>>>>>>>>>>>> Field(MetaDataEnum.BODY.getDescription(),
bodyText.trim(),
>>>>>>>>>>>>> Field.Store.YES,
>>>>>>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>>>>>>             document.add(field);
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am using Java Built in RTF text extraction.
 When I  
>>>>>>>>>>>>> run my test to
>>>>>>>>>>>>> verify that the document contains text
that I expect  
>>>>>>>>>>>>> this works
>>>>>>>>>>>>> fine.
>>>>>>>>>>>>> I
>>>>>>>>>>>>> get
>>>>>>>>>>>>> the following when I print the document:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This
 
>>>>>>>>>>>>> is a test
>>>>>>>>>>>>> rtf
>>>>>>>>>>>>> document that will be indexed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>>>>>>> stored/uncompressed,indexed<summary:This
is a >>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem is when I use the following
to search I get  
>>>>>>>>>>>>> no result:
>>>>>>>>>>>>>
>>>>>>>>>>>>> MultiSearcher multiSearcher = new MultiSearcher(new
 
>>>>>>>>>>>>> Searchable[]
>>>>>>>>>>>>> {rtfIndexSearcher});
>>>>>>>>>>>>>             Term t = new Term("body",
"Amin");
>>>>>>>>>>>>>             TermQuery termQuery = new
TermQuery(t);
>>>>>>>>>>>>>             TopDocs topDocs =  
>>>>>>>>>>>>> multiSearcher.search(termQuery,
>>>>>>>>>>>>> 1);
>>>>>>>>>>>>>             System.out.println(topDocs.totalHits);
>>>>>>>>>>>>>             multiSearcher.close();
>>>>>>>>>>>>>
>>>>>>>>>>>>> RftIndexSearcher is configured with the
directory that  
>>>>>>>>>>>>> holds rtf
>>>>>>>>>>>>> documents.  I have used Luke to look
at the document and  
>>>>>>>>>>>>> what I am
>>>>>>>>>>>>> finding
>>>>>>>>>>>>> in the overview tab is the following
for the document:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1       body    test
>>>>>>>>>>>>> 1       id      1234
>>>>>>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>>>>>>> 1       summary This is a
>>>>>>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>>>>>>> 1       body    rtf
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> However on the Document tab I am getting
(in the body  
>>>>>>>>>>>>> field):
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is a test rtf document that will
be indexed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would expect to get a hit using "Amin"
or even  
>>>>>>>>>>>>> "document".  I am
>>>>>>>>>>>>> not
>>>>>>>>>>>>> sure whether the
>>>>>>>>>>>>> line:
>>>>>>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery,
1);
>>>>>>>>>>>>>
>>>>>>>>>>>>> is incorrect as I am not too sure of
the meaning of  
>>>>>>>>>>>>> "Finds the top n
>>>>>>>>>>>>> hits
>>>>>>>>>>>>> for query." for search (Query query,
int n) according to  
>>>>>>>>>>>>> java docs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would be grateful if someone may be
able to advise on  
>>>>>>>>>>>>> what I may
>>>>>>>>>>>>> be
>>>>>>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> Amin
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message