manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [Windows Shares Connector] Un-expected removal of all documents
Date Tue, 07 Apr 2015 13:52:00 GMT
Yes, this is exactly what I was thinking of.  You can go ahead and commit
this to trunk, and pull up the change to the dev_1x branch also.

Thanks!
Karl


On Tue, Apr 7, 2015 at 8:42 AM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> Hi Karl,
> just back to the issue, I think I solved it in a quick way ( not so much
> intrusive) :
>
>
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector#getDocumentVersions
>
> org/apache/manifoldcf/crawler/connectors/sharedrive/SharedDriveConnector.java:706
>
> ...
>
> catch ( jcifs.smb.SmbAuthException e )
> {
>     Logging.connectors.warn(
>         "JCIFS: Authorization exception reading version information
> for " + documentIdentifier
>             + " - skipping" );
>     if(e.getMessage().equals("Logon failure: unknown user name or bad
> password."))
>         throw new ManifoldCFException( "SmbAuthException thrown: " +
> e.getMessage(), e );
>     else
>         rval[i] = null;
> }
>
> ...
>
> In this way the message is checked, and if it is a Login failure we
> throw the manifoldCFException breaking the iteration ( because login
> failure means no documents will be accessible but we don't have to
> erase them) .
>
> If it is another Authorization exception ( like permissions changed
> for the folder/file) the behaviour is the same than before.
>
> I think should be enough to be safe, what do you think ?
>
> Is any other method affected by this problem ?
>
> I think should be limited to the versions check.
>
>
> Cheers
>
>
> 2015-04-02 16:32 GMT+01:00 Alessandro Benedetti <
> benedetti.alex85@gmail.com>
> :
>
> >
> >
> > 2015-04-02 15:58 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> >
> >> Hi Alessandro,
> >>
> >> Yes, you interpreted my reply correctly.
> >>
> >> I think we therefore have to perform any checking operations on the
> actual
> >> file being accessed.  This is actually pretty easy to do without
> >> sacrificing performance.  All you need to do is the following:
> >>
> >> try {
> >>   ... do the file access operation ...
> >> } catch (SmbException e) {
> >>   ... figure out from the exception whether to throw a
> ManifoldCFException
> >> or a ServiceInterruption ...
> >>   ... If the exception does not include enough to distinguish between
> bad
> >> credentials and insufficient privs, then do a check RIGHT HERE for bad
> >> credentials ...
> >> }
> >>
> >> What do you think?  The new code would only ever be called if the
> document
> >> cannot be read.
> >>
> >
> > I think we can proceed like you said, I am investigating right now the
> > details returned for the exception ( to understand if there is any
> > difference between wrong credentials or access denied)
> > In the case we find the "wrong credential" we have to throw the exception
> > and stop the iteration ( this will happen the very first time assuming
> none
> > is playing server side) .
> > In this way we save the time of checking all the files ( in the case of
> > wrong credentials no one will be accessible) .
> >
> > Another way can be to do this credential check at the beginning and stop
> > only if we have wrong credential ( leaving the permission check file by
> > file) .
> >
> > Quite a confused scenario, but we can sort this out with little changes
> :)
> >
> >
> >
> >>
> >> Karl
> >>
> >>
> >> On Thu, Apr 2, 2015 at 10:42 AM, Alessandro Benedetti <
> >> benedetti.alex85@gmail.com> wrote:
> >>
> >> > OkI am currently working on that, and I will work on that next tuesday
> >> as
> >> > well .
> >> > But what about point 2 :
> >> > " (2) the check itself is
> >> > specific to the ROOT of the tree, which the user may not have access
> >> to."
> >> >
> >> > I think I got your problem, you mean that a possible scenario can
> happen
> >> > when you configure the repository connector with a user that  is not
> >> able
> >> > to access the root but is able to access the directories we want to
> >> crawl.
> >> > In such a case the repository connector will appear to be not able to
> >> > connect, while the crawling will be still possible if you configure
> the
> >> > accessible directories in the job.
> >> > If this is correct , the situation is more complicated ...
> >> >
> >> > Cheers
> >> >
> >> >
> >> > 2015-03-31 16:44 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> >> >
> >> > > Hi Alessandro,
> >> > >
> >> > > Your code snippet has two problems: (1) it doesn't distinguish
> between
> >> > > service interruptions and bad credentials,
> >> >
> >> >
> >> > Should not be the difference between the IOException and the Smb one ?
> >> >
> >> >
> >> > > and (2) the check itself is
> >> > > specific to the ROOT of the tree, which the user may not have access
> >> to.
> >> > >
> >> >
> >> >
> >> >
> >> > > In check() we can get away with this but if you wire up the check()
> >> logic
> >> > > into the crawl processing it will break some people.
> >> > >
> >> > > The first problem, (1), is exactly what we need to figure out
> anyway.
> >> > >
> >> > > Karl
> >> > >
> >> > >
> >> > > On Tue, Mar 31, 2015 at 11:30 AM, Alessandro Benedetti <
> >> > > benedetti.alex85@gmail.com> wrote:
> >> > >
> >> > > > Hi karl comments follow :
> >> > > >
> >> > > > 2015-03-31 16:18 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> >> > > >
> >> > > > > Hi Alessandro,
> >> > > > >
> >> > > > > There are situations where the check() method does not succeed
> but
> >> > you
> >> > > > can
> >> > > > > still crawl.  So I would not do it that way, since it
> >> fundamentally
> >> > > > changes
> >> > > > > the contract.
> >> > > > >
> >> > > >
> >> > > > Am I wrong or we should assume the "check()" method to work as
> it's
> >> > built
> >> > > > for.
> >> > > > I mean if in some case, this method is wrongly implemented ,
this
> >> can
> >> > not
> >> > > > break another assumption.
> >> > > >
> >> > > > >
> >> > > > > My proposal is to have processDocuments ABORT the job when
it
> >> finds
> >> > bad
> >> > > > > credentials.  That's very fast and will not permit a job
to run
> >> for a
> >> > > > long
> >> > > > > time.
> >> > > > >
> >> > > > > The trick is to determine if there are bad credentials WITHOUT
> >> doing
> >> > > any
> >> > > > > more work in the processDocuments pathway than we currently
are.
> >> An
> >> > > > > exception will be thrown either way, but we need to figure
out
> >> > whether
> >> > > > > there is any information in the exception that we can use
to
> >> decide
> >> > > > between
> >> > > > > bad credentials and no access permissions.
> >> > > > >
> >> > > > > You can help provide that by doing a simple experiment on
your
> >> > client's
> >> > > > > hardware (or yours, if you have such hardware in house).
 Change
> >> the
> >> > > > > credential to an invalid one and see what the exception
details
> >> are.
> >> > > > Then
> >> > > > > change to valid credentials and try to crawl a directory
that is
> >> not
> >> > > > > visible to the credentialed user you supplied, and make
a note
> of
> >> the
> >> > > > > exception details in that case too.
> >> > > > >
> >> > > >
> >> > > > I was thinking to slightly modifying the getSession() method
> adding
> >> the
> >> > > > file exist check , something like this :
> >> > > >
> >> > > > ...
> >> > > >
> >> > > > try
> >> > > > {
> >> > > >     // use NtlmPasswordAuthentication so that we can reuse
> >> credential
> >> > > > for DFS support
> >> > > >     pa = new NtlmPasswordAuthentication( domain, username,
> password
> >> );
> >> > > >     SmbFile smbconnection = new SmbFile( "smb://" + server +
"/",
> >> pa );
> >> > > >     smbconnectionPath = getFileCanonicalPath( smbconnection );
> >> > > >     smbconnection.exists();
> >> > > > }
> >> > > > catch ( MalformedURLException e )
> >> > > > {
> >> > > >     Logging.connectors.error(
> >> > > >         "Unable to access SMB/CIFS share: " + "smb://" + ( (
> domain
> >> ==
> >> > > > null ) ? "" : domain ) + ";"
> >> > > >             + username + ":<password>@" + server + "/\n"
+ e );
> >> > > >     throw new ManifoldCFException( "Unable to access SMB/CIFS
> >> share: "
> >> > > > + server, e,
> >> > > >
> >> > > > ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
> >> > > > } catch (SmbException e) {
> >> > > >     Logging.connectors.error(
> >> > > >             "Unable to access SMB/CIFS share: Credential not
valid
> >> - "
> >> > > > + "smb://" + ((domain == null) ? "" : domain) + ";"
> >> > > >                     + username + ":<password>@" + server
+ "/\n" +
> >> e);
> >> > > >     throw new ManifoldCFException( "Unable to access SMB/CIFS
> share:
> >> > > > Credential not valid - " + server, e,
> >> > > >             ManifoldCFException.REPOSITORY_CONNECTION_ERROR );
> >> > > > }
> >> > > >
> >> > > > Catching the smbException should make the trick.
> >> > > > Anyway I will go more in details.
> >> > > >
> >> > > > Cheers
> >> > > >
> >> > > >
> >> > > > > Karl
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Mar 31, 2015 at 10:50 AM, Alessandro Benedetti <
> >> > > > > benedetti.alex85@gmail.com> wrote:
> >> > > > >
> >> > > > > > Currently we are checking each of the String[] oldVersions
,
> >> trying
> >> > > to
> >> > > > > > access it ...
> >> > > > > > So in the scenario I described the current performances
are
> >> quite
> >> > > > bad...
> >> > > > > > We would need to avoid at all the scan of the oldDocs
if we
> know
> >> > the
> >> > > > > > provided credential are not valid anymore .
> >> > > > > >
> >> > > > > > Let me be extreme, but what about not allowing the
job to
> start
> >> at
> >> > > all
> >> > > > if
> >> > > > > > the Repository Connector is currently broken ( i.e.
the
> >> connection
> >> > is
> >> > > > not
> >> > > > > > working, and we know that because of the check method)
.
> >> > > > > > In this way we avoid to destroy already existent indexes
and
> we
> >> > > simply
> >> > > > > > communicate a message in the job giving advice the
job can not
> >> > start
> >> > > > > > because Repository connector is currently offline (
and
> showing
> >> the
> >> > > > > > explanation) .
> >> > > > > >
> >> > > > > > Does this make sense ?
> >> > > > > >
> >> > > > > > 2015-03-31 15:30 GMT+01:00 Karl Wright <daddywri@gmail.com>:
> >> > > > > >
> >> > > > > > > Hi Alessandro,
> >> > > > > > >
> >> > > > > > > If you put a check in the processDocuments method,
it will
> be
> >> > > called
> >> > > > > for
> >> > > > > > > every group of documents.  That's fine, but if
you structure
> >> it
> >> > as
> >> > > a
> >> > > > > > > separate call it would impact performance.  That
is why I
> >> suggest
> >> > > > just
> >> > > > > > > doing a better job of interpreting the existing
exceptions.
> >> > > > > > >
> >> > > > > > > Karl
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Tue, Mar 31, 2015 at 10:27 AM, Alessandro Benedetti
<
> >> > > > > > > benedetti.alex85@gmail.com> wrote:
> >> > > > > > >
> >> > > > > > > > As an addition, this should be quite simple,
not
> proceeding
> >> > with
> >> > > > the
> >> > > > > > > > processDocuments method, if the RepositoryConnector
is not
> >> able
> >> > > to
> >> > > > > > > connect(
> >> > > > > > > > check method return not a proper message).
> >> > > > > > > >
> >> > > > > > > > Right ?
> >> > > > > > > > Wondering where is the proper point to enter
the action :)
> >> > > > > > > >
> >> > > > > > > > Cheers
> >> > > > > > > >
> >> > > > > > > > 2015-03-31 14:59 GMT+01:00 Alessandro Benedetti
<
> >> > > > > > > > benedetti.alex85@gmail.com>
> >> > > > > > > > :
> >> > > > > > > >
> >> > > > > > > > > Yes Karl,
> >> > > > > > > > >  I was thinking exactly that, to first
check if the
> >> > credentials
> >> > > > are
> >> > > > > > > > valid,
> >> > > > > > > > > before scanning all the documents.
> >> > > > > > > > > This because permissions per files depend
on
> users/groups,
> >> > but
> >> > > > the
> >> > > > > > > > current
> >> > > > > > > > > scenario is not in-validating the user,
but invalidating
> >> the
> >> > > > access
> >> > > > > > of
> >> > > > > > > > that
> >> > > > > > > > > user.
> >> > > > > > > > >
> >> > > > > > > > > An error must be thrown, but the docs
not deleted ( not
> >> even
> >> > > > > > scanned) .
> >> > > > > > > > >
> >> > > > > > > > > Furthermore, what will happen, in the
case the server is
> >> > down ?
> >> > > > > > > > > Are we safe in that scenario ?
> >> > > > > > > > >
> >> > > > > > > > > Cheers
> >> > > > > > > > >
> >> > > > > > > > > 2015-03-31 14:42 GMT+01:00 Karl Wright
<
> >> daddywri@gmail.com>:
> >> > > > > > > > >
> >> > > > > > > > >> This is actually pretty standard
behavior across our
> >> > connector
> >> > > > > > family,
> >> > > > > > > > and
> >> > > > > > > > >> has been true since Day One.  The
behavior comes from
> the
> >> > > basic
> >> > > > > > broad
> >> > > > > > > > >> requirement that the crawler should
keep going and skip
> >> the
> >> > > > > document
> >> > > > > > > > when
> >> > > > > > > > >> the permissions do not allow it
to be fetched.  With
> the
> >> > > Windows
> >> > > > > > Share
> >> > > > > > > > >> connector, it's sometimes the case
(when DFS is used a
> >> lot)
> >> > > that
> >> > > > > > whole
> >> > > > > > > > >> subtrees of documents are not fetchable
using the
> >> > credentials
> >> > > > > > > supplied.
> >> > > > > > > > >> So
> >> > > > > > > > >> it is not so easy to just check
for valid credentials
> at
> >> the
> >> > > > > > > beginning.
> >> > > > > > > > >>
> >> > > > > > > > >> For a solution, I'd be inclined
to look for a way to
> >> figure
> >> > > out
> >> > > > if
> >> > > > > > the
> >> > > > > > > > >> credentials are actually *invalid*,
and abort the job
> if
> >> so.
> >> > > > This
> >> > > > > > is
> >> > > > > > > > >> distinct from the case where the
credentials are valid
> >> but
> >> > the
> >> > > > > > > connector
> >> > > > > > > > >> doesn't have permissions to read
the document.  It will
> >> take
> >> > > > some
> >> > > > > > > > >> experimentation to see if we get
back different
> exception
> >> > text
> >> > > > in
> >> > > > > > the
> >> > > > > > > > two
> >> > > > > > > > >> situations.
> >> > > > > > > > >>
> >> > > > > > > > >> Karl
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> On Tue, Mar 31, 2015 at 9:30 AM,
Alessandro Benedetti <
> >> > > > > > > > >> abenedetti@apache.org
> >> > > > > > > > >> > wrote:
> >> > > > > > > > >>
> >> > > > > > > > >> > Hi guys,
> >> > > > > > > > >> > playing with the Windows Shares
Connector in
> ManifoldCF
> >> > 1.8
> >> > > I
> >> > > > > > > > >> encountered
> >> > > > > > > > >> > this problem :
> >> > > > > > > > >> >
> >> > > > > > > > >> > *Scenario*
> >> > > > > > > > >> > *1)* Indexing windows Shares
server
> >> > > > > > > > >> > *2)* Indexing successfully
finished with N docs
> indexed
> >> > > > > > > > >> > *3)* Offline ,while no indexing
is happening, Shares
> >> > server
> >> > > > > side,
> >> > > > > > > the
> >> > > > > > > > >> > Administrator password changes
> >> > > > > > > > >> > *4) *Repository Connector is
not able to connect
> >> > anymore(of
> >> > > > > course
> >> > > > > > > > >> because
> >> > > > > > > > >> > the password has changed)
> >> > > > > > > > >> > *5)* Next indexing cycle, ALL
docs are removed from
> the
> >> > > index
> >> > > > .
> >> > > > > > > > >> >
> >> > > > > > > > >> > *Expected Behaviour*
> >> > > > > > > > >> > As I user I would like to see
an error message, that
> >> will
> >> > > let
> >> > > > me
> >> > > > > > > > >> understand
> >> > > > > > > > >> > the issue, not losing all my
N indexed docs .
> >> > > > > > > > >> >
> >> > > > > > > > >> > *Reason*
> >> > > > > > > > >> > Taking a look into the code,
the problems seems to be
> >> in
> >> > > the :
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector#getDocumentVersions
> >> > > > > > > > >> > where it tries to access each
document singularly
> >> through
> >> > > > Samba,
> >> > > > > > and
> >> > > > > > > > >> > removing them one by one if
not reachable anymore.
> >> > > > > > > > >> >
> >> > > > > > > > >> > *Solution*
> >> > > > > > > > >> > Before scanning each document,
we have to be sure the
> >> > > > connection
> >> > > > > > is
> >> > > > > > > > >> > working.
> >> > > > > > > > >> > If not this is only armful.
> >> > > > > > > > >> >
> >> > > > > > > > >> > I will continue investigating,
but I would like your
> >> > opinion
> >> > > > as
> >> > > > > > well
> >> > > > > > > > >> >
> >> > > > > > > > >> > Cheers
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> > --
> >> > > > > > > > >> > --------------------------
> >> > > > > > > > >> >
> >> > > > > > > > >> > Benedetti Alessandro
> >> > > > > > > > >> > Visiting card : http://about.me/alessandro_benedetti
> >> > > > > > > > >> >
> >> > > > > > > > >> > "Tyger, tyger burning bright
> >> > > > > > > > >> > In the forests of the night,
> >> > > > > > > > >> > What immortal hand or eye
> >> > > > > > > > >> > Could frame thy fearful symmetry?"
> >> > > > > > > > >> >
> >> > > > > > > > >> > William Blake - Songs of Experience
-1794 England
> >> > > > > > > > >> >
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > --------------------------
> >> > > > > > > > >
> >> > > > > > > > > Benedetti Alessandro
> >> > > > > > > > > Visiting card : http://about.me/alessandro_benedetti
> >> > > > > > > > >
> >> > > > > > > > > "Tyger, tyger burning bright
> >> > > > > > > > > In the forests of the night,
> >> > > > > > > > > What immortal hand or eye
> >> > > > > > > > > Could frame thy fearful symmetry?"
> >> > > > > > > > >
> >> > > > > > > > > William Blake - Songs of Experience
-1794 England
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > --------------------------
> >> > > > > > > >
> >> > > > > > > > Benedetti Alessandro
> >> > > > > > > > Visiting card : http://about.me/alessandro_benedetti
> >> > > > > > > >
> >> > > > > > > > "Tyger, tyger burning bright
> >> > > > > > > > In the forests of the night,
> >> > > > > > > > What immortal hand or eye
> >> > > > > > > > Could frame thy fearful symmetry?"
> >> > > > > > > >
> >> > > > > > > > William Blake - Songs of Experience -1794
England
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > --------------------------
> >> > > > > >
> >> > > > > > Benedetti Alessandro
> >> > > > > > Visiting card : http://about.me/alessandro_benedetti
> >> > > > > >
> >> > > > > > "Tyger, tyger burning bright
> >> > > > > > In the forests of the night,
> >> > > > > > What immortal hand or eye
> >> > > > > > Could frame thy fearful symmetry?"
> >> > > > > >
> >> > > > > > William Blake - Songs of Experience -1794 England
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > --------------------------
> >> > > >
> >> > > > Benedetti Alessandro
> >> > > > Visiting card : http://about.me/alessandro_benedetti
> >> > > >
> >> > > > "Tyger, tyger burning bright
> >> > > > In the forests of the night,
> >> > > > What immortal hand or eye
> >> > > > Could frame thy fearful symmetry?"
> >> > > >
> >> > > > William Blake - Songs of Experience -1794 England
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > --------------------------
> >> >
> >> > Benedetti Alessandro
> >> > Visiting card : http://about.me/alessandro_benedetti
> >> >
> >> > "Tyger, tyger burning bright
> >> > In the forests of the night,
> >> > What immortal hand or eye
> >> > Could frame thy fearful symmetry?"
> >> >
> >> > William Blake - Songs of Experience -1794 England
> >> >
> >>
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message