manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject RE: Crawling and indexing very slow
Date Fri, 01 Aug 2014 22:44:22 GMT
I am surprised this works at all, since it uses Java file io to do this.

Karl

Sent from my Windows Phone
------------------------------
From: Ameya Aware
Sent: 8/1/2014 3:48 PM
To: user@manifoldcf.apache.org
Subject: Re: Crawling and indexing very slow

Hi Karl,

Above changes provide great throughput for crawling of local system.

Now i tried running job for shared drive. I know things wont be as fast as
local drive. But i wanted to look how things change with shared drive.

When i ran job with shared drive it errored out with message "Error: IO
Error: \\devshare\devl\jneiper\Testcases for Galaxy.doc: The specified
network name is no longer available.".

This is not Solr error because i could find in Solr log.

I found error in MCF log. please find below stack trace:

ERROR 2014-08-01 14:41:17,531 (Worker thread '88') - Exception tossed: IO
Error: \\devshare\devl\jneiper\Testcases for Galaxy.doc: The specified
network name is no longer available.

org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO Error:
\\devshare\devl\jneiper\Testcases for Galaxy.doc: The specified network
name is no longer available.

at
org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:417)
 at
org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:433)
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:565)
Caused by: java.nio.file.FileSystemException:
\\devshare\devl\jneiper\Testcases for Galaxy.doc: The specified network
name is no longer available.

at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
 at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
at sun.nio.fs.WindowsAclFileAttributeView.getFileSecurity(Unknown Source)
 at sun.nio.fs.WindowsAclFileAttributeView.getOwner(Unknown Source)
at sun.nio.fs.FileOwnerAttributeViewImpl.getOwner(Unknown Source)
 at java.nio.file.Files.getOwner(Unknown Source)
at
org.apache.manifoldcf.crawler.connectors.filesystem.FileConnector.processDocuments(FileConnector.java:391)
 ... 2 more


Can you give me some suggestions?

Thanks,
Ameya



On Thu, Jul 31, 2014 at 4:24 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Ameya,
>
> You cannot just comment out that line; instead you must supply an input
> stream.  But you can create a null input stream, for example:
>
> data.setBinary(new ByteArrayInputStream(new byte[0]),0);
>
> Karl
>
>
> On Thu, Jul 31, 2014 at 4:22 PM, Ameya Aware <ameya.aware@gmail.com>
> wrote:
>
>> >>>>>>>>>>>>>>>>>>>>>>>>>>
>>                     long fileBytes = file.length();
>>                     RepositoryDocument data = new RepositoryDocument();
>>                     data.setBinary(is,fileBytes);
>>                     String fileName = file.getName();
>>                     data.setFileName(fileName);
>>                     data.setMimeType(mapExtensionToMimeType(fileName));
>>
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<
>>
>>
>> do i just need to comment out 3rd line i.e. data.setBinary(is,fileBytes);
>> ??
>>
>>
>> Thanks,
>> Ameya
>>
>>
>> On Thu, Jul 31, 2014 at 4:17 PM, Ameya Aware <ameya.aware@gmail.com>
>> wrote:
>>
>>> I could not exactly locate the position where this is happening.
>>>
>>> Can you please help me out with the changes?
>>>
>>> Thanks,
>>> Ameya
>>>
>>>
>>>
>>> On Thu, Jul 31, 2014 at 4:10 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Ameya,
>>>>
>>>> Since you are already modifying the connector for your purposes,
>>>> nothing is stopping you from modifying it further to not fetch the document
>>>> and instead substitute an empty input stream.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Thu, Jul 31, 2014 at 3:03 PM, Ameya Aware <ameya.aware@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> i have modified code a little to add different metadata fields such as
>>>>> below (FileConnector.java):
>>>>>
>>>>>                     data.addField("created", new
>>>>> Date((attr.creationTime().toMillis())));
>>>>>                    data.addField("last_accessed", new
>>>>> Date(attr.lastAccessTime().toMillis()));
>>>>>                     data.addField("last_modified", new
>>>>> Date(file.lastModified()));
>>>>>                     data.addField("size", file.length());
>>>>>
>>>>>
>>>>> which are being passed to Solr.
>>>>>
>>>>> Now can i stop MCF from reading a file and sending that content and
>>>>> just passed above information to Solr?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Ameya
>>>>>
>>>>>
>>>>> On Thu, Jul 31, 2014 at 2:57 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ameya,
>>>>>>
>>>>>> The file system connector does not retrieve any metadata for a
>>>>>> document at all.  So I'm not sure what metadata you are talking about.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 31, 2014 at 2:44 PM, Ameya Aware <ameya.aware@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> So the thing here is i am not looking for any data or content
of any
>>>>>>> of files. I am just interested in metadata of file.
>>>>>>>
>>>>>>> So i thought it should be possible to not read any file and just
get
>>>>>>> metadata of file and give to Solr.
>>>>>>>
>>>>>>> This should save lots of time.
>>>>>>>
>>>>>>> Is it possible to do this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ameya
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 31, 2014 at 2:13 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Ameya,
>>>>>>>>
>>>>>>>> (1) Please look at the Simple History report.  Note what
kinds of
>>>>>>>> documents are being fetched, what kinds are being indexed,
and how long it
>>>>>>>> is taking.  I have noted from your previous posts that you
seem to be
>>>>>>>> indexing a lot of very large EXE files.  This is useless
and you should be
>>>>>>>> excluding them.
>>>>>>>>
>>>>>>>> (2) Please look in the manifoldcf.log file for evidence that
>>>>>>>> fetches and/or Solr indexing requests are being retried due
to errors.  It
>>>>>>>> doesn't take many documents being chronically retried before
forward
>>>>>>>> progress drops to near zero.
>>>>>>>>
>>>>>>>> (3) If you look into (1) & (2) and everything seems fine,
it may be
>>>>>>>> a misalignment between availability of several kinds of resources
that is
>>>>>>>> the problem.  Please get a thread dump of the agents process
while it is
>>>>>>>> crawling, using jstack.  Post that thread dump and we can
tell you what to
>>>>>>>> look at next.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 31, 2014 at 2:07 PM, Ameya Aware <ameya.aware@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am using filesystem connector to index my entire C
drive using
>>>>>>>>> Solr as output connector.
>>>>>>>>>
>>>>>>>>> Initial 100000 documents were crawled and indexed successfully
in
>>>>>>>>> couple of hours but after that indexing slowed down badly
(around 15-20
>>>>>>>>> documents per min).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not able to figure out whether there is issue with
MCF or
>>>>>>>>> Solr.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can you advice me how to proceed with this?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ameya
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message