manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Problem with directories in hebrew and jcifs
Date Wed, 01 May 2013 12:53:45 GMT
There is also a different way to do this entirely - there is a path
attribute you can send as metadata to Solr.  Just include the entire path,
and put it into a different field that you declare in your schema.  See
"path attribute" in the end-user documentation for the JCIFS connector.



On Wed, May 1, 2013 at 8:52 AM, Karl Wright <daddywri@gmail.com> wrote:

> IE 6 is extremely old and I believe we developed for IE 7 at a minimum
> (there were two different versions with different functionality we had to
> support there), and made further changes for IE 8 when it came out.  I have
> no idea what IE 9 or IE 10 do.
>
> The only way to change the encoding of the IRI is to modify the JCIFS
> connector code.  But please bear in mind that unless you can show your
> modifications will work across a wide variety of browsers, we are unlikely
> to accept these changes back into the code base.
>
> The alternative is, since the encoding IS deterministic and reversible,
> you could readily write a Tika plugin that would modify at least the URL
> field in the manner you desire.  But you could not modify the ID field
> since ManifoldCF uses this to delete documents that have disappeared.
>
> Karl
>
>
>
> On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <nachum234@gmail.com> wrote:
>
>> The IRI is not working in my IE. I am using old version of IE V6 SP3.
>> But what I realy want is to display the correct name of the path with
>> hebrew characters.
>> If I understand you right, then I need to change the representation of
>> the IRI. How can I do that?
>> On May 1, 2013 3:14 PM, "Karl Wright" <daddywri@gmail.com> wrote:
>>
>>> Right, that is exactly what I would expect.
>>>
>>> ManifoldCF uses a URL (which is constructed by the connector) as the
>>> primary key for every document as indexed in the search engine.  The URL
>>> has two purposes: first, it is supposed to be unique, and second, it is
>>> supposed to allow someone who browses to that result to locate the
>>> document.  In the case of JCIFS, the environment is presumed to be the
>>> local active directory domain(s), and the "URL" generated is really a file
>>> IRI, usually of the form "file://///server.domain/path/filename".  You thus
>>> should be able to paste the "URL" of the document from Solr into a browser
>>> on a machine in the domain, and see the document load.
>>>
>>> As I said before, however, there are already certain problems with this
>>> because each version of IE differs somewhat in how it deals with non-ASCII
>>> characters.  IRI legal character rules are somewhat different than URL
>>> rules, but IRI's are still nevertheless escaped in various ways.  There are
>>> also multiple equivalent ways of representing the same file path with
>>> different IRI's.
>>>
>>> It is not typical that the ID and URL fields of a document are presented
>>> to the user in any meaningful way, so your question is usually academic in
>>> most settings.  If you have a problem with the IRI's not actually working
>>> in a browser, that's of more immediate interest.  Please let us know if
>>> that's the case.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum <nachum234@gmail.com>wrote:
>>>
>>>> Thanks for your response
>>>> I am seeing these characters in solr when I search these files.
>>>> I am using the solr example site and these characters show up in the ID
>>>> field and URL field.
>>>> BTW I am running solr and mcf on a linux server
>>>>  On May 1, 2013 1:11 PM, "Karl Wright" <daddywri@gmail.com> wrote:
>>>>
>>>>> Where are you seeing these characters?  Are you talking about the file
>>>>> IRI's that the JCIFS connector generates?  Those IRI's are supposed to
be
>>>>> constructed so that your browser would find them if you paste them into
the
>>>>> browser URL window.  Unfortunately, there is no good standard, and people
>>>>> follow IE's behavior, and IE has changed multiple times in how it deals
>>>>> with non-latin-1 characters.
>>>>>
>>>>> Please provide a bit more information so that we can provide a better
>>>>> answer.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum <nachum234@gmail.com>wrote:
>>>>>
>>>>>> Hello,
>>>>>> I install search server with solr and manifoldcf.
>>>>>> I want to index my netapp files over cifs and I have a problem with
>>>>>> hebrew files and directories.
>>>>>> When I search for these files in solr I see "%D7%91%D7%..." instead
>>>>>> of the directory path that contain hebrew characters .
>>>>>> I try to run the java process with "-Djcifs.encoding=cp1255" but
it
>>>>>> didn't help.
>>>>>> Can anyone help and tell me how can I index directories/files in
>>>>>> hebrew?
>>>>>>
>>>>>> Thanks
>>>>>> Yossi
>>>>>>
>>>>>
>>>>>
>>>
>

Mime
View raw message