manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Problem with directories in hebrew and jcifs
Date Wed, 01 May 2013 12:52:07 GMT
IE 6 is extremely old and I believe we developed for IE 7 at a minimum
(there were two different versions with different functionality we had to
support there), and made further changes for IE 8 when it came out.  I have
no idea what IE 9 or IE 10 do.

The only way to change the encoding of the IRI is to modify the JCIFS
connector code.  But please bear in mind that unless you can show your
modifications will work across a wide variety of browsers, we are unlikely
to accept these changes back into the code base.

The alternative is, since the encoding IS deterministic and reversible, you
could readily write a Tika plugin that would modify at least the URL field
in the manner you desire.  But you could not modify the ID field since
ManifoldCF uses this to delete documents that have disappeared.

Karl



On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <nachum234@gmail.com> wrote:

> The IRI is not working in my IE. I am using old version of IE V6 SP3.
> But what I realy want is to display the correct name of the path with
> hebrew characters.
> If I understand you right, then I need to change the representation of the
> IRI. How can I do that?
> On May 1, 2013 3:14 PM, "Karl Wright" <daddywri@gmail.com> wrote:
>
>> Right, that is exactly what I would expect.
>>
>> ManifoldCF uses a URL (which is constructed by the connector) as the
>> primary key for every document as indexed in the search engine.  The URL
>> has two purposes: first, it is supposed to be unique, and second, it is
>> supposed to allow someone who browses to that result to locate the
>> document.  In the case of JCIFS, the environment is presumed to be the
>> local active directory domain(s), and the "URL" generated is really a file
>> IRI, usually of the form "file://///server.domain/path/filename".  You thus
>> should be able to paste the "URL" of the document from Solr into a browser
>> on a machine in the domain, and see the document load.
>>
>> As I said before, however, there are already certain problems with this
>> because each version of IE differs somewhat in how it deals with non-ASCII
>> characters.  IRI legal character rules are somewhat different than URL
>> rules, but IRI's are still nevertheless escaped in various ways.  There are
>> also multiple equivalent ways of representing the same file path with
>> different IRI's.
>>
>> It is not typical that the ID and URL fields of a document are presented
>> to the user in any meaningful way, so your question is usually academic in
>> most settings.  If you have a problem with the IRI's not actually working
>> in a browser, that's of more immediate interest.  Please let us know if
>> that's the case.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum <nachum234@gmail.com> wrote:
>>
>>> Thanks for your response
>>> I am seeing these characters in solr when I search these files.
>>> I am using the solr example site and these characters show up in the ID
>>> field and URL field.
>>> BTW I am running solr and mcf on a linux server
>>>  On May 1, 2013 1:11 PM, "Karl Wright" <daddywri@gmail.com> wrote:
>>>
>>>> Where are you seeing these characters?  Are you talking about the file
>>>> IRI's that the JCIFS connector generates?  Those IRI's are supposed to be
>>>> constructed so that your browser would find them if you paste them into the
>>>> browser URL window.  Unfortunately, there is no good standard, and people
>>>> follow IE's behavior, and IE has changed multiple times in how it deals
>>>> with non-latin-1 characters.
>>>>
>>>> Please provide a bit more information so that we can provide a better
>>>> answer.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum <nachum234@gmail.com>wrote:
>>>>
>>>>> Hello,
>>>>> I install search server with solr and manifoldcf.
>>>>> I want to index my netapp files over cifs and I have a problem with
>>>>> hebrew files and directories.
>>>>> When I search for these files in solr I see "%D7%91%D7%..." instead of
>>>>> the directory path that contain hebrew characters .
>>>>> I try to run the java process with "-Djcifs.encoding=cp1255" but it
>>>>> didn't help.
>>>>> Can anyone help and tell me how can I index directories/files in
>>>>> hebrew?
>>>>>
>>>>> Thanks
>>>>> Yossi
>>>>>
>>>>
>>>>
>>

Mime
View raw message