lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Search Problem
Date Sat, 03 Jan 2009 01:23:42 GMT
Well, your query results are consistent with what Luke is
reporting. So I'd go back and test your assumptions. I
suspect that you're not indexing what you think you are.

For your test document, I'd just print out what you're indexing
and the field it's going into. *for each field*. that is, every time you
do a document.add(<field of some kind>), print out that data. I'm
pretty sure you'll find that you're not getting what you expect. For
instance, the call to:

MetaDataEnum.BODY.getDescription()

may be returning some nonsense. Or
bodyText.trim()

isn't doing what you expect.

Lucene is used by many folks, and errors of the magnitude you're
experiencing would be seen by many people and the user list would
be flooded with complaints if it were a Lucene issue at root. That
leaves the code you wrote as the most likely culprit. So try a very simple
test case with lots of debugging println's. I'm pretty sure you'll
find the underlying issue with some of your assumptions pretty quickly.

Sorry I can't be more specific, but we'd have to see all of your code
and the test cases to do that....

Best
Erick

On Fri, Jan 2, 2009 at 6:13 PM, Amin Mohammed-Coleman <aminmc@gmail.com>wrote:

> Hi Erick
>
> Thanks for your reply.
>
> I have used luke to inspect the document and I am some what confused.  For
> example when I view the index using the overview tab of Luke I get the
> following:
>
> 1       body    test
> 1       id      1234
> 1       name    rtfDocumentToIndex.rtf
> 1       path    rtfDocumentToIndex.rtf
> 1       summary This is a
> 1       type    RTF_INDEXER
> 1       body    rtf
>
>
> However when I view the document in the Document tab I get the full text
> that was extracted from the rft document (field:body) which is:
>
> This is a test rtf document that will be indexed.
> Amin Mohammed-Coleman
>
> I am using the StandardAnaylzer therefore I wouldnt expect the words
> document, indexed, Amin Mohammed-Coleman to be removed.
>
> I have referenced the Lucene In Action book and I can't see what I may be
> doing wrong.  I would be happy to provide a testcase should it be required.
>  When adding the body field to the document I am doing:
>
>        Document document = new Document();
>                        Field field = new
> Field(FieldNameEnum.BODY.getDescription(), bodyText.trim(), Field.Store.YES,
> Field.Index.ANALYZED);
>                        document.add(field);
>
>
>
> When I run the search code the string "test" is the only word that returns
> a result (TopDocs), whereas the others do not (e.g. "amin", "document",
> "indexed").
>
> Thanks again for your help and advice.
>
>
> Cheers
> Amin
>
>
>
>
> On 2 Jan 2009, at 21:20, Erick Erickson wrote:
>
>  Casing is usually handled by the analyzer. Since you construct
>> the term query programmatically, it doesn't go through
>> any analyzers, thus is not converted into lower case for
>> searching as was done automatically for you when you
>> indexed using StandardAnalyzer.
>>
>> As for why you aren't getting hits, it's unclear to me. But
>> what I'd do is get a copy of Luke and examine your index
>> to see what's *really* there. This will often give you clues,
>> usually pointing to some kind of analyzer behavior that you
>> weren't expecting.
>>
>> Best
>> Erick
>>
>> On Fri, Jan 2, 2009 at 6:39 AM, Amin Mohammed-Coleman <aminmc@gmail.com
>> >wrote:
>>
>>  Hi
>>>
>>> I have tried this and it doesn't work.  I don't understand why using
>>> "amin"
>>> instead of "Amin" would work, is it not case insensitive?
>>>
>>> I tried "test" for field "body" and this works.  Any other terms don't
>>> work
>>> for example:
>>>
>>> "document"
>>> "indexed"
>>>
>>> these are tokens that were extracted when creating the lucene document.
>>>
>>>
>>> Thanks for your reply.
>>>
>>> Cheers
>>>
>>> Amin
>>>
>>>
>>> On 2 Jan 2009, at 10:36, Chris Lu wrote:
>>>
>>> Basically Lucene stores analyzed tokens, and looks up for the matches
>>>
>>>> based
>>>> on the tokens.
>>>> "Amin" after StandardAnalyzer is "amin", so you need to use new
>>>> Term("body",
>>>> "amin"), instead of new Term("body", "Amin"), to search.
>>>>
>>>> --
>>>> Chris Lu
>>>> -------------------------
>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>> site: http://www.dbsight.net
>>>> demo: http://search.dbsight.com
>>>> Lucene Database Search in 3 minutes:
>>>>
>>>>
>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>> got
>>>> 2.6 Million Euro funding!
>>>>
>>>> On Thu, Jan 1, 2009 at 11:30 PM, Amin Mohammed-Coleman <
>>>> aminmc@gmail.com
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>> Hi
>>>>
>>>>>
>>>>> Sorry I was using the StandardAnalyzer in this instance.
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2 Jan 2009, at 00:55, Chris Lu wrote:
>>>>>
>>>>> You need to let us know the analyzer you are using.
>>>>>
>>>>>  -- Chris Lu
>>>>>> -------------------------
>>>>>> Instant Scalable Full-Text Search On Any Database/Application
>>>>>> site: http://www.dbsight.net
>>>>>> demo: http://search.dbsight.com
>>>>>> Lucene Database Search in 3 minutes:
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
>>>>>> DBSight customer, a shopping comparison site, (anonymous per request)
>>>>>> got
>>>>>> 2.6 Million Euro funding!
>>>>>>
>>>>>> On Thu, Jan 1, 2009 at 1:11 PM, Amin Mohammed-Coleman <
>>>>>> aminmc@gmail.com
>>>>>>
>>>>>>  wrote:
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>  Hi
>>>>>>>
>>>>>>>
>>>>>>>> I have created a RTFHandler which takes a RTF file and creates
a
>>>>>>>> lucene
>>>>>>>> Document which is indexed.  The RTFHandler looks like something
like
>>>>>>>> this:
>>>>>>>>
>>>>>>>> if (bodyText != null) {
>>>>>>>>                  Document document = new Document();
>>>>>>>>                  Field field = new
>>>>>>>> Field(MetaDataEnum.BODY.getDescription(), bodyText.trim(),
>>>>>>>> Field.Store.YES,
>>>>>>>> Field.Index.ANALYZED);
>>>>>>>>                  document.add(field);
>>>>>>>>
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> I am using Java Built in RTF text extraction.  When I run
my test to
>>>>>>>> verify that the document contains text that I expect this
works
>>>>>>>> fine.
>>>>>>>> I
>>>>>>>> get
>>>>>>>> the following when I print the document:
>>>>>>>>
>>>>>>>> Document<stored/uncompressed,indexed,tokenized<body:This
is a test
>>>>>>>> rtf
>>>>>>>> document that will be indexed.
>>>>>>>>
>>>>>>>> Amin Mohammed-Coleman>
>>>>>>>> stored/uncompressed,indexed<path:rtfDocumentToIndex.rtf>
>>>>>>>> stored/uncompressed,indexed<name:rtfDocumentToIndex.rtf>
>>>>>>>> stored/uncompressed,indexed<type:RTF_INDEXER>
>>>>>>>> stored/uncompressed,indexed<summary:This is a >>
>>>>>>>>
>>>>>>>>
>>>>>>>> The problem is when I use the following to search I get no
result:
>>>>>>>>
>>>>>>>>  MultiSearcher multiSearcher = new MultiSearcher(new Searchable[]
>>>>>>>> {rtfIndexSearcher});
>>>>>>>>                  Term t = new Term("body", "Amin");
>>>>>>>>                  TermQuery termQuery = new TermQuery(t);
>>>>>>>>                  TopDocs topDocs = multiSearcher.search(termQuery,
>>>>>>>> 1);
>>>>>>>>                  System.out.println(topDocs.totalHits);
>>>>>>>>                  multiSearcher.close();
>>>>>>>>
>>>>>>>> RftIndexSearcher is configured with the directory that holds
rtf
>>>>>>>> documents.  I have used Luke to look at the document and
what I am
>>>>>>>> finding
>>>>>>>> in the overview tab is the following for the document:
>>>>>>>>
>>>>>>>> 1       body    test
>>>>>>>> 1       id      1234
>>>>>>>> 1       name    rtfDocumentToIndex.rtf
>>>>>>>> 1       path    rtfDocumentToIndex.rtf
>>>>>>>> 1       summary This is a
>>>>>>>> 1       type    RTF_INDEXER
>>>>>>>> 1       body    rtf
>>>>>>>>
>>>>>>>>
>>>>>>>> However on the Document tab I am getting (in the body field):
>>>>>>>>
>>>>>>>> This is a test rtf document that will be indexed.
>>>>>>>>
>>>>>>>> Amin Mohammed-Coleman
>>>>>>>>
>>>>>>>>
>>>>>>>> I would expect to get a hit using "Amin" or even "document".
 I am
>>>>>>>> not
>>>>>>>> sure whether the
>>>>>>>> line:
>>>>>>>> TopDocs topDocs = multiSearcher.search(termQuery, 1);
>>>>>>>>
>>>>>>>> is incorrect as I am not too sure of the meaning of "Finds
the top n
>>>>>>>> hits
>>>>>>>> for query." for search (Query query, int n) according to
java docs.
>>>>>>>>
>>>>>>>> I would be grateful if someone may be able to advise on what
I may
>>>>>>>> be
>>>>>>>> doing wrong.  I am using Lucene 2.4.0
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Amin
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message