manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alessandro Benedetti <benedetti.ale...@gmail.com>
Subject Re: [Windows Shares Connector] Un-expected removal of all documents
Date Thu, 02 Apr 2015 14:42:41 GMT
OkI am currently working on that, and I will work on that next tuesday as
well .
But what about point 2 :
" (2) the check itself is
specific to the ROOT of the tree, which the user may not have access to."

I think I got your problem, you mean that a possible scenario can happen
when you configure the repository connector with a user that  is not able
to access the root but is able to access the directories we want to crawl.
In such a case the repository connector will appear to be not able to
connect, while the crawling will be still possible if you configure the
accessible directories in the job.
If this is correct , the situation is more complicated ...

Cheers


2015-03-31 16:44 GMT+01:00 Karl Wright <daddywri@gmail.com>:

> Hi Alessandro,
>
> Your code snippet has two problems: (1) it doesn't distinguish between
> service interruptions and bad credentials,


Should not be the difference between the IOException and the Smb one ?


> and (2) the check itself is
> specific to the ROOT of the tree, which the user may not have access to.
>



> In check() we can get away with this but if you wire up the check() logic
> into the crawl processing it will break some people.
>
> The first problem, (1), is exactly what we need to figure out anyway.
>
> Karl
>
>
> On Tue, Mar 31, 2015 at 11:30 AM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
> > Hi karl comments follow :
> >
> > 2015-03-31 16:18 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> >
> > > Hi Alessandro,
> > >
> > > There are situations where the check() method does not succeed but you
> > can
> > > still crawl.  So I would not do it that way, since it fundamentally
> > changes
> > > the contract.
> > >
> >
> > Am I wrong or we should assume the "check()" method to work as it's built
> > for.
> > I mean if in some case, this method is wrongly implemented , this can not
> > break another assumption.
> >
> > >
> > > My proposal is to have processDocuments ABORT the job when it finds bad
> > > credentials.  That's very fast and will not permit a job to run for a
> > long
> > > time.
> > >
> > > The trick is to determine if there are bad credentials WITHOUT doing
> any
> > > more work in the processDocuments pathway than we currently are.  An
> > > exception will be thrown either way, but we need to figure out whether
> > > there is any information in the exception that we can use to decide
> > between
> > > bad credentials and no access permissions.
> > >
> > > You can help provide that by doing a simple experiment on your client's
> > > hardware (or yours, if you have such hardware in house).  Change the
> > > credential to an invalid one and see what the exception details are.
> > Then
> > > change to valid credentials and try to crawl a directory that is not
> > > visible to the credentialed user you supplied, and make a note of the
> > > exception details in that case too.
> > >
> >
> > I was thinking to slightly modifying the getSession() method adding the
> > file exist check , something like this :
> >
> > ...
> >
> > try
> > {
> >     // use NtlmPasswordAuthentication so that we can reuse credential
> > for DFS support
> >     pa = new NtlmPasswordAuthentication( domain, username, password );
> >     SmbFile smbconnection = new SmbFile( "smb://" + server + "/", pa );
> >     smbconnectionPath = getFileCanonicalPath( smbconnection );
> >     smbconnection.exists();
> > }
> > catch ( MalformedURLException e )
> > {
> >     Logging.connectors.error(
> >         "Unable to access SMB/CIFS share: " + "smb://" + ( ( domain ==
> > null ) ? "" : domain ) + ";"
> >             + username + ":<password>@" + server + "/\n" + e );
> >     throw new ManifoldCFException( "Unable to access SMB/CIFS share: "
> > + server, e,
> >
> > ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
> > } catch (SmbException e) {
> >     Logging.connectors.error(
> >             "Unable to access SMB/CIFS share: Credential not valid - "
> > + "smb://" + ((domain == null) ? "" : domain) + ";"
> >                     + username + ":<password>@" + server + "/\n" + e);
> >     throw new ManifoldCFException( "Unable to access SMB/CIFS share:
> > Credential not valid - " + server, e,
> >             ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
> > }
> >
> > Catching the smbException should make the trick.
> > Anyway I will go more in details.
> >
> > Cheers
> >
> >
> > > Karl
> > >
> > >
> > > On Tue, Mar 31, 2015 at 10:50 AM, Alessandro Benedetti <
> > > benedetti.alex85@gmail.com> wrote:
> > >
> > > > Currently we are checking each of the String[] oldVersions , trying
> to
> > > > access it ...
> > > > So in the scenario I described the current performances are quite
> > bad...
> > > > We would need to avoid at all the scan of the oldDocs if we know the
> > > > provided credential are not valid anymore .
> > > >
> > > > Let me be extreme, but what about not allowing the job to start at
> all
> > if
> > > > the Repository Connector is currently broken ( i.e. the connection is
> > not
> > > > working, and we know that because of the check method) .
> > > > In this way we avoid to destroy already existent indexes and we
> simply
> > > > communicate a message in the job giving advice the job can not start
> > > > because Repository connector is currently offline ( and showing the
> > > > explanation) .
> > > >
> > > > Does this make sense ?
> > > >
> > > > 2015-03-31 15:30 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> > > >
> > > > > Hi Alessandro,
> > > > >
> > > > > If you put a check in the processDocuments method, it will be
> called
> > > for
> > > > > every group of documents.  That's fine, but if you structure it as
> a
> > > > > separate call it would impact performance.  That is why I suggest
> > just
> > > > > doing a better job of interpreting the existing exceptions.
> > > > >
> > > > > Karl
> > > > >
> > > > >
> > > > > On Tue, Mar 31, 2015 at 10:27 AM, Alessandro Benedetti <
> > > > > benedetti.alex85@gmail.com> wrote:
> > > > >
> > > > > > As an addition, this should be quite simple, not proceeding
with
> > the
> > > > > > processDocuments method, if the RepositoryConnector is not able
> to
> > > > > connect(
> > > > > > check method return not a proper message).
> > > > > >
> > > > > > Right ?
> > > > > > Wondering where is the proper point to enter the action :)
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > 2015-03-31 14:59 GMT+01:00 Alessandro Benedetti <
> > > > > > benedetti.alex85@gmail.com>
> > > > > > :
> > > > > >
> > > > > > > Yes Karl,
> > > > > > >  I was thinking exactly that, to first check if the credentials
> > are
> > > > > > valid,
> > > > > > > before scanning all the documents.
> > > > > > > This because permissions per files depend on users/groups,
but
> > the
> > > > > > current
> > > > > > > scenario is not in-validating the user, but invalidating
the
> > access
> > > > of
> > > > > > that
> > > > > > > user.
> > > > > > >
> > > > > > > An error must be thrown, but the docs not deleted ( not
even
> > > > scanned) .
> > > > > > >
> > > > > > > Furthermore, what will happen, in the case the server is
down ?
> > > > > > > Are we safe in that scenario ?
> > > > > > >
> > > > > > > Cheers
> > > > > > >
> > > > > > > 2015-03-31 14:42 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> > > > > > >
> > > > > > >> This is actually pretty standard behavior across our
connector
> > > > family,
> > > > > > and
> > > > > > >> has been true since Day One.  The behavior comes from
the
> basic
> > > > broad
> > > > > > >> requirement that the crawler should keep going and
skip the
> > > document
> > > > > > when
> > > > > > >> the permissions do not allow it to be fetched.  With
the
> Windows
> > > > Share
> > > > > > >> connector, it's sometimes the case (when DFS is used
a lot)
> that
> > > > whole
> > > > > > >> subtrees of documents are not fetchable using the credentials
> > > > > supplied.
> > > > > > >> So
> > > > > > >> it is not so easy to just check for valid credentials
at the
> > > > > beginning.
> > > > > > >>
> > > > > > >> For a solution, I'd be inclined to look for a way to
figure
> out
> > if
> > > > the
> > > > > > >> credentials are actually *invalid*, and abort the job
if so.
> > This
> > > > is
> > > > > > >> distinct from the case where the credentials are valid
but the
> > > > > connector
> > > > > > >> doesn't have permissions to read the document.  It
will take
> > some
> > > > > > >> experimentation to see if we get back different exception
text
> > in
> > > > the
> > > > > > two
> > > > > > >> situations.
> > > > > > >>
> > > > > > >> Karl
> > > > > > >>
> > > > > > >>
> > > > > > >> On Tue, Mar 31, 2015 at 9:30 AM, Alessandro Benedetti
<
> > > > > > >> abenedetti@apache.org
> > > > > > >> > wrote:
> > > > > > >>
> > > > > > >> > Hi guys,
> > > > > > >> > playing with the Windows Shares Connector in ManifoldCF
1.8
> I
> > > > > > >> encountered
> > > > > > >> > this problem :
> > > > > > >> >
> > > > > > >> > *Scenario*
> > > > > > >> > *1)* Indexing windows Shares server
> > > > > > >> > *2)* Indexing successfully finished with N docs
indexed
> > > > > > >> > *3)* Offline ,while no indexing is happening,
Shares server
> > > side,
> > > > > the
> > > > > > >> > Administrator password changes
> > > > > > >> > *4) *Repository Connector is not able to connect
anymore(of
> > > course
> > > > > > >> because
> > > > > > >> > the password has changed)
> > > > > > >> > *5)* Next indexing cycle, ALL docs are removed
from the
> index
> > .
> > > > > > >> >
> > > > > > >> > *Expected Behaviour*
> > > > > > >> > As I user I would like to see an error message,
that will
> let
> > me
> > > > > > >> understand
> > > > > > >> > the issue, not losing all my N indexed docs .
> > > > > > >> >
> > > > > > >> > *Reason*
> > > > > > >> > Taking a look into the code, the problems seems
to be in
> the :
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector#getDocumentVersions
> > > > > > >> > where it tries to access each document singularly
through
> > Samba,
> > > > and
> > > > > > >> > removing them one by one if not reachable anymore.
> > > > > > >> >
> > > > > > >> > *Solution*
> > > > > > >> > Before scanning each document, we have to be sure
the
> > connection
> > > > is
> > > > > > >> > working.
> > > > > > >> > If not this is only armful.
> > > > > > >> >
> > > > > > >> > I will continue investigating, but I would like
your opinion
> > as
> > > > well
> > > > > > >> >
> > > > > > >> > Cheers
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > --------------------------
> > > > > > >> >
> > > > > > >> > Benedetti Alessandro
> > > > > > >> > Visiting card : http://about.me/alessandro_benedetti
> > > > > > >> >
> > > > > > >> > "Tyger, tyger burning bright
> > > > > > >> > In the forests of the night,
> > > > > > >> > What immortal hand or eye
> > > > > > >> > Could frame thy fearful symmetry?"
> > > > > > >> >
> > > > > > >> > William Blake - Songs of Experience -1794 England
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > --------------------------
> > > > > > >
> > > > > > > Benedetti Alessandro
> > > > > > > Visiting card : http://about.me/alessandro_benedetti
> > > > > > >
> > > > > > > "Tyger, tyger burning bright
> > > > > > > In the forests of the night,
> > > > > > > What immortal hand or eye
> > > > > > > Could frame thy fearful symmetry?"
> > > > > > >
> > > > > > > William Blake - Songs of Experience -1794 England
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > --------------------------
> > > > > >
> > > > > > Benedetti Alessandro
> > > > > > Visiting card : http://about.me/alessandro_benedetti
> > > > > >
> > > > > > "Tyger, tyger burning bright
> > > > > > In the forests of the night,
> > > > > > What immortal hand or eye
> > > > > > Could frame thy fearful symmetry?"
> > > > > >
> > > > > > William Blake - Songs of Experience -1794 England
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --------------------------
> > > >
> > > > Benedetti Alessandro
> > > > Visiting card : http://about.me/alessandro_benedetti
> > > >
> > > > "Tyger, tyger burning bright
> > > > In the forests of the night,
> > > > What immortal hand or eye
> > > > Could frame thy fearful symmetry?"
> > > >
> > > > William Blake - Songs of Experience -1794 England
> > > >
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message