manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: web crawler not sharing cookies
Date Wed, 25 Jul 2018 22:06:18 GMT
The web connector, though, does not filter any cookies.  It takes them all
-- whatever cookies HttpClient is storing at that point.  So you should see
all the cookies in the database table, regardless of their site affinity,
unless HttpClient is refusing to accept a cookie for security reasons.

It's also possible that HttpClient is selective about which cookies to
transmit on a page fetch.

Can you look in the database and tell me whether your cookie gets stored,
or not?  If not, then HttpClient's cookie acceptance policy is not lenient
enough.  If it is in the database, then it's the transmission policy that
is too strict.

Thanks,
Karl


On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez <gustavo.beneitez@gmail.com>
wrote:

> I agree, but the fact is that if my "login sequence" defines a login
> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
> X.Y.Z.com", none of the sub-sites receives that cookie, I need to write
> same cookie  for every sub-domain, that solves the situation (and
> thankfully is a language cookie and not a dynamic one).
>
> Regards.
>
> El mié., 25 jul. 2018 a las 19:17, Karl Wright (<daddywri@gmail.com>)
> escribió:
>
>> You should not need to fill the database by hand.  Your login sequence
>> should include whatever redirection etc is used to set the cookies though.
>>
>> Karl
>>
>>
>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>>> Hi again,
>>>
>>> Thanks Karl, I was able of doing that after defining some "login
>>> sequence", but also after filling database (cookiedata table) with certain
>>> values due to "domain constrictions".
>>> Before every web call, I suspect Manifold only takes cookies from URL
>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "z.com"
>>> it won't be sent, so I added every subdomain by hand and started to work.
>>>
>>> Regards.
>>>
>>>
>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>>> gustavo.beneitez@gmail.com>) escribió:
>>>
>>>> Hi,
>>>>
>>>> thanks a lot, please let me check then the documentation for an example
>>>> of that.
>>>>
>>>> Regards!
>>>>
>>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<daddywri@gmail.com>)
>>>> escribió:
>>>>
>>>>> You are correct that cookies are not shared among threads.  That is by
>>>>> design.
>>>>>
>>>>> The only way to set cookies for the WebConnector is to have there be
a
>>>>> "login sequence".  The login sequence sets cookies that are then used
by
>>>>> all subsequent fetches.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>>>> gustavo.beneitez@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I have tried to look for an answer before writing this email, no
>>>>>> luck. Sorry for the inconvenience if it is already answered.
>>>>>>
>>>>>> I need to set a cookie at the begining of the web crawling. The
>>>>>> cookie rules the language you get the content, and while there are
several
>>>>>> choices, if no cookie is found there will be a "default language".
>>>>>>
>>>>>> I made a JSP which sets the cookie and contains several links (href),
>>>>>> and pointed ManifoldCF to this page as the repository seed. I expected
to
>>>>>> get the crawling engine starting to capture links with correct language
>>>>>> indicated by the cookie, but what I really got is a lot of content
shown in
>>>>>> default language.
>>>>>>
>>>>>> What I think about that is that cookies are not shared between thread
>>>>>> spiders, so it is not possible to get cookies remain between links.
Cookie
>>>>>> domain is correct, also cookie expiration
>>>>>>
>>>>>> I would appreciate so much  if you can help me on this.
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>>
>>>>>>

Mime
View raw message