jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danilo Barboza" <danilo.barb...@gmail.com>
Subject Problems searching HTML binary UTF-8 encoded
Date Mon, 25 Aug 2008 15:58:59 GMT
Hail!!

I am having some problems while tries to search over a HTML content in a
jcr:contet node with properties:

jcr:mimeType = "text/html"
jcr:encoding = "UTF-8"
jcr:data = "<html><head></head><body> Some content with acute á á
á
</body></html>"

When I try to search using

//element(*, nt:resource)[jcr:contains(., "á")]

I recieving none result... All my Strings are UTF-8 encoded, that is the JVM
Default.

When I try to search using

//element(*, nt:resource)[jcr:contains(., "á")]

I receive the expected result, but with this latin-converted string in place
of my "á" UTF-8 string.

I've write a simple sample demonstrating the problem (see attachment).

When you run the sample you must set the defaul JVM encondig to UTF-8
passing -Dfile.encoding=UTF-8 argument to JVM.

I also have tested with other binary content (like MSWord DOC) and
everything is going ok...

The sample code says more than I can explain.

Someone knows why this occour only with HTML binary content? Maybe the
HTMLTextExtractor?

Thanks,

Danilo Barboza

Mime
View raw message