jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wider <pat_wi...@yahoo.fr>
Subject Re : Binary Content Search Problem...
Date Tue, 23 Oct 2007 09:27:59 GMT
Hi Ard,

Thanx for your answer.... Especially the part concerning the logs... So I could realize that
they were disabled... Shame on me !;-)
Anyway... the logs showed me that some jars were missing in the classpath.
After correction, I re-created my repository again with one Node where I attached 3 files
(the means, the creation of a nt:file node with a nt:resource node for each attached file).
My files are:
1. I set up the jcr:data property with a String, as you asked me to do... I put text/plain
as mimetype (since the field is mandatory)
2. jcr:data is set up with a stream on a simple text file (mime type: text/plain)
3. jcr:data is set up with a stream on a Word Document file (mimetype: application/msword)

I created this nodes and here are extracts of the logs the I got related to indexing. (note
that there is no error log in the whole log file, only debug)
file 1: 
DEBUG - persisting change log {#addedStates=15, #modifiedStates=1, #deletedStates=0, #modifiedRefs=0}
took 172ms
DEBUG - notifying 3 synchronous listeners.
DEBUG - onEvent: indexing started
DEBUG - extractText(stream, text/plain, )
DEBUG - onEvent: indexing finished in 31 ms.

file 2:
DEBUG - persisting change log {#addedStates=11, #modifiedStates=1, #deletedStates=0, #modifiedRefs=0}
took 79ms
DEBUG - notifying 3 synchronous listeners.
DEBUG - onEvent: indexing started
DEBUG - extractText(stream, text/plain, )
DEBUG - onEvent: indexing finished in 0 ms.
DEBUG - got EventStateCollection

file 3:
DEBUG - persisting change log {#addedStates=11, #modifiedStates=1, #deletedStates=0, #modifiedRefs=0}
took 125ms
DEBUG - notifying 3 synchronous listeners.
DEBUG - onEvent: indexing started
DEBUG - extractText(stream, application/msword, )
DEBUG - onEvent: indexing finished in 78 ms.
DEBUG - got EventStateCollection


And checking the state of the index with Luke, I could figure out that file 3 (Word) was tokenized...
but the content of file 1 and 2 don't appear anywhere, even though the respective properties
and nodes do appear!!!
Consquently, when I run the following XPath query:
/jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]

The only result is the Word Document...

What happened with the 2 other files?
Maybe the mimetype is wrong (text/plain) ?
Or did I forget to define something ?
Maybe I did something wrong in my filter definition, which is:
   <param name="textFilterClasses" 
    value="org.apache.jackrabbit.extractor.PlainTextExtractor,
      org.apache.jackrabbit.extractor.MsWordTextExtractor,
      org.apache.jackrabbit.extractor.MsExcelTextExtractor,
      org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
      org.apache.jackrabbit.extractor.PdfTextExtractor,
      org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
      org.apache.jackrabbit.extractor.RTFTextExtractor,
      org.apache.jackrabbit.extractor.HTMLTextExtractor,
      org.apache.jackrabbit.extractor.XMLTextExtractor"/>


I thought that org.apache.jackrabbit.extractor.PlainTextExtractor could handle simple text
files... 
As you can see, it is getting better, but I still need a little help ;-) so if you haven any
idea, don't hesitate

Thank you in advance,
BR
Patrick



----- Message d'origine ----
De : Ard Schrijvers <a.schrijvers@hippo.nl>
À : users@jackrabbit.apache.org; Patrick Wider <pat_wider@yahoo.fr>
Envoyé le : Lundi, 22 Octobre 2007, 14h59mn 53s
Objet : RE: Binary Content Search Problem...

Hello Patrick,


> Patrick Wider wrote:
> 
> Of course the files contain somehow 'myKeyWord'... the text 
> file contains it for sure, but in the Document, 'myKeyWord' 
> is wrapped by bold and italic styles. But I don't think the 
> styles cause any problems... on the other hand, I have no 
> idea how the extractors works ;-) it's just a guess....

Just for pinpointing the problem, what happens if:

1) you search for a word that is not with bold or italic styles?
2) if you replace inputstr with "a string to test myKeyWord", and then
do the search again

You might want to turn on the logging for the indexing and extractors,
perhaps they reveal some problems. Furthermore you might want to take a
look at the latest created index folder after adding a binary doc with
luke [1] and see if the binary data is present as tokens in the index

Regards Ard

[1] http://www.getopt.org/luke/

>


      _____________________________________________________________________________ 
Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail 

Mime
View raw message