manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yossi Nachum <>
Subject Re: Problem with directories in hebrew and jcifs
Date Fri, 03 May 2013 11:49:50 GMT
That is working. I created a path field in my schema and use the "path
I have one problem, I don't see the name of the cifs server, just the path
inside it.
I try to use "Match Regexp" in the metadata tab with the following values:
Match regexp: "(.*)"
Replace string: "file:////server_name/$(1)"

but it did not work. Still seeing the path only.

What am I doing wrong? How can I add my server name to the path?


On Wed, May 1, 2013 at 4:10 PM, Yossi Nachum <> wrote:

> Thanks I will try that
> On May 1, 2013 3:54 PM, "Karl Wright" <> wrote:
>> There is also a different way to do this entirely - there is a path
>> attribute you can send as metadata to Solr.  Just include the entire path,
>> and put it into a different field that you declare in your schema.  See
>> "path attribute" in the end-user documentation for the JCIFS connector.
>> On Wed, May 1, 2013 at 8:52 AM, Karl Wright <> wrote:
>>> IE 6 is extremely old and I believe we developed for IE 7 at a minimum
>>> (there were two different versions with different functionality we had to
>>> support there), and made further changes for IE 8 when it came out.  I have
>>> no idea what IE 9 or IE 10 do.
>>> The only way to change the encoding of the IRI is to modify the JCIFS
>>> connector code.  But please bear in mind that unless you can show your
>>> modifications will work across a wide variety of browsers, we are unlikely
>>> to accept these changes back into the code base.
>>> The alternative is, since the encoding IS deterministic and reversible,
>>> you could readily write a Tika plugin that would modify at least the URL
>>> field in the manner you desire.  But you could not modify the ID field
>>> since ManifoldCF uses this to delete documents that have disappeared.
>>> Karl
>>> On Wed, May 1, 2013 at 8:45 AM, Yossi Nachum <>wrote:
>>>> The IRI is not working in my IE. I am using old version of IE V6 SP3.
>>>> But what I realy want is to display the correct name of the path with
>>>> hebrew characters.
>>>> If I understand you right, then I need to change the representation of
>>>> the IRI. How can I do that?
>>>> On May 1, 2013 3:14 PM, "Karl Wright" <> wrote:
>>>>> Right, that is exactly what I would expect.
>>>>> ManifoldCF uses a URL (which is constructed by the connector) as the
>>>>> primary key for every document as indexed in the search engine.  The
>>>>> has two purposes: first, it is supposed to be unique, and second, it
>>>>> supposed to allow someone who browses to that result to locate the
>>>>> document.  In the case of JCIFS, the environment is presumed to be the
>>>>> local active directory domain(s), and the "URL" generated is really a
>>>>> IRI, usually of the form "file://///server.domain/path/filename".  You
>>>>> should be able to paste the "URL" of the document from Solr into a browser
>>>>> on a machine in the domain, and see the document load.
>>>>> As I said before, however, there are already certain problems with
>>>>> this because each version of IE differs somewhat in how it deals with
>>>>> non-ASCII characters.  IRI legal character rules are somewhat different
>>>>> than URL rules, but IRI's are still nevertheless escaped in various ways.
>>>>> There are also multiple equivalent ways of representing the same file
>>>>> with different IRI's.
>>>>> It is not typical that the ID and URL fields of a document are
>>>>> presented to the user in any meaningful way, so your question is usually
>>>>> academic in most settings.  If you have a problem with the IRI's not
>>>>> actually working in a browser, that's of more immediate interest.  Please
>>>>> let us know if that's the case.
>>>>> Thanks,
>>>>> Karl
>>>>> On Wed, May 1, 2013 at 8:04 AM, Yossi Nachum <>wrote:
>>>>>> Thanks for your response
>>>>>> I am seeing these characters in solr when I search these files.
>>>>>> I am using the solr example site and these characters show up in
>>>>>> ID field and URL field.
>>>>>> BTW I am running solr and mcf on a linux server
>>>>>>  On May 1, 2013 1:11 PM, "Karl Wright" <>
>>>>>>> Where are you seeing these characters?  Are you talking about
>>>>>>> file IRI's that the JCIFS connector generates?  Those IRI's are
supposed to
>>>>>>> be constructed so that your browser would find them if you paste
them into
>>>>>>> the browser URL window.  Unfortunately, there is no good standard,
>>>>>>> people follow IE's behavior, and IE has changed multiple times
in how it
>>>>>>> deals with non-latin-1 characters.
>>>>>>> Please provide a bit more information so that we can provide
>>>>>>> better answer.
>>>>>>> Karl
>>>>>>> On Wed, May 1, 2013 at 3:11 AM, Yossi Nachum <>wrote:
>>>>>>>> Hello,
>>>>>>>> I install search server with solr and manifoldcf.
>>>>>>>> I want to index my netapp files over cifs and I have a problem
>>>>>>>> hebrew files and directories.
>>>>>>>> When I search for these files in solr I see "%D7%91%D7%..."
>>>>>>>> of the directory path that contain hebrew characters .
>>>>>>>> I try to run the java process with "-Djcifs.encoding=cp1255"
but it
>>>>>>>> didn't help.
>>>>>>>> Can anyone help and tell me how can I index directories/files
>>>>>>>> hebrew?
>>>>>>>> Thanks
>>>>>>>> Yossi

View raw message