jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danilo Barboza" <danilo.barb...@gmail.com>
Subject Re: Problems searching HTML binary UTF-8 encoded
Date Thu, 28 Aug 2008 17:58:01 GMT
Thanks for reply, Marcel. But my problem is a little more especific...

I have posted a jira issue
<http://issues.apache.org/jira/browse/JCR-1727>that illustrates my
problem more accurately.

So let's wait a response...

On Tue, Aug 26, 2008 at 4:30 AM, Marcel Reutegger
<marcel.reutegger@gmx.net>wrote:

> Hi Danilo,
>
> this indicates that the default encoding of your platform is ISO-8859-1.
> See
> [1]. you should rather use [2] instead and specify "UTF-8".
>
> regards
>  marcel
>
>
> [1] http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes()<http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes%28%29>
> [2]
>
> http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes(java.lang.String)<http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes%28java.lang.String%29>
>
> Danilo Barboza wrote:
> > Hail!!
> >
> > I am having some problems while tries to search over a HTML content in a
> > jcr:contet node with properties:
> >
> > jcr:mimeType = "text/html"
> > jcr:encoding = "UTF-8"
> > jcr:data = "<html><head></head><body> Some content with
acute á á á
> > </body></html>"
> >
> > When I try to search using
> >
> > //element(*, nt:resource)[jcr:contains(., "á")]
> >
> > I recieving none result... All my Strings are UTF-8 encoded, that is the
> JVM
> > Default.
> >
> > When I try to search using
> >
> > //element(*, nt:resource)[jcr:contains(., "á")]
> >
> > I receive the expected result, but with this latin-converted string in
> place
> > of my "á" UTF-8 string.
> >
> > I've write a simple sample demonstrating the problem (see attachment).
> >
> > When you run the sample you must set the defaul JVM encondig to UTF-8
> > passing -Dfile.encoding=UTF-8 argument to JVM.
> >
> > I also have tested with other binary content (like MSWord DOC) and
> > everything is going ok...
> >
> > The sample code says more than I can explain.
> >
> > Someone knows why this occour only with HTML binary content? Maybe the
> > HTMLTextExtractor?
> >
> > Thanks,
> >
> > Danilo Barboza
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message