jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: Re : Binary Content Search Problem...
Date Tue, 23 Oct 2007 09:55:29 GMT

Hello Patrick,

didn't file 3 replace file 2 and file 1 perhaps? You did a session.save() after each different
file? 

Do I understand correctly that you now at least get a hit for  

/jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]

where you did not have this one before?

Ard

> 
> Hi Ard,
> 
> Thanx for your answer.... Especially the part concerning the 
> logs... So I could realize that they were disabled... Shame 
> on me !;-) Anyway... the logs showed me that some jars were 
> missing in the classpath.
> After correction, I re-created my repository again with one 
> Node where I attached 3 files (the means, the creation of a 
> nt:file node with a nt:resource node for each attached file). 
> My files are:
> 1. I set up the jcr:data property with a String, as you asked 
> me to do... I put text/plain as mimetype (since the field is 
> mandatory) 2. jcr:data is set up with a stream on a simple 
> text file (mime type: text/plain) 3. jcr:data is set up with 
> a stream on a Word Document file (mimetype: application/msword)
> 
> I created this nodes and here are extracts of the logs the I 
> got related to indexing. (note that there is no error log in 
> the whole log file, only debug) file 1: 
> DEBUG - persisting change log {#addedStates=15, 
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 
> 172ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent: 
> indexing finished in 31 ms.
> 
> file 2:
> DEBUG - persisting change log {#addedStates=11, 
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 
> 79ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, text/plain, ) DEBUG - onEvent: 
> indexing finished in 0 ms.
> DEBUG - got EventStateCollection
> 
> file 3:
> DEBUG - persisting change log {#addedStates=11, 
> #modifiedStates=1, #deletedStates=0, #modifiedRefs=0} took 
> 125ms DEBUG - notifying 3 synchronous listeners.
> DEBUG - onEvent: indexing started
> DEBUG - extractText(stream, application/msword, ) DEBUG - 
> onEvent: indexing finished in 78 ms.
> DEBUG - got EventStateCollection
> 
> 
> And checking the state of the index with Luke, I could figure 
> out that file 3 (Word) was tokenized... but the content of 
> file 1 and 2 don't appear anywhere, even though the 
> respective properties and nodes do appear!!!
> Consquently, when I run the following XPath query:
> /jcr:root//element(*, nt:resource)[(jcr:contains(., 'MyKeyWord'))]
> 
> The only result is the Word Document...
> 
> What happened with the 2 other files?
> Maybe the mimetype is wrong (text/plain) ?
> Or did I forget to define something ?
> Maybe I did something wrong in my filter definition, which is:
>    <param name="textFilterClasses" 
>     value="org.apache.jackrabbit.extractor.PlainTextExtractor,
>       org.apache.jackrabbit.extractor.MsWordTextExtractor,
>       org.apache.jackrabbit.extractor.MsExcelTextExtractor,
>       org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,
>       org.apache.jackrabbit.extractor.PdfTextExtractor,
>       org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,
>       org.apache.jackrabbit.extractor.RTFTextExtractor,
>       org.apache.jackrabbit.extractor.HTMLTextExtractor,
>       org.apache.jackrabbit.extractor.XMLTextExtractor"/>
> 
> 
> I thought that 
> org.apache.jackrabbit.extractor.PlainTextExtractor could 
> handle simple text files... 
> As you can see, it is getting better, but I still need a 
> little help ;-) so if you haven any idea, don't hesitate
> 
> Thank you in advance,
> BR
> Patrick
> 
> 
> 
> ----- Message d'origine ----
> De : Ard Schrijvers <a.schrijvers@hippo.nl> À : 
> users@jackrabbit.apache.org; Patrick Wider 
> <pat_wider@yahoo.fr> Envoyé le : Lundi, 22 Octobre 2007, 
> 14h59mn 53s Objet : RE: Binary Content Search Problem...
> 
> Hello Patrick,
> 
> 
> > Patrick Wider wrote:
> > 
> > Of course the files contain somehow 'myKeyWord'... the text file 
> > contains it for sure, but in the Document, 'myKeyWord'
> > is wrapped by bold and italic styles. But I don't think the styles 
> > cause any problems... on the other hand, I have no idea how the 
> > extractors works ;-) it's just a guess....
> 
> Just for pinpointing the problem, what happens if:
> 
> 1) you search for a word that is not with bold or italic styles?
> 2) if you replace inputstr with "a string to test myKeyWord", 
> and then do the search again
> 
> You might want to turn on the logging for the indexing and 
> extractors, perhaps they reveal some problems. Furthermore 
> you might want to take a look at the latest created index 
> folder after adding a binary doc with luke [1] and see if the 
> binary data is present as tokens in the index
> 
> Regards Ard
> 
> [1] http://www.getopt.org/luke/
> 
> >
> 
> 
>       
> ______________________________________________________________
> _______________
> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails 
> vers Yahoo! Mail 
> 

Mime
View raw message