manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Beneitez <gustavo.benei...@gmail.com>
Subject Re: web crawler not sharing cookies
Date Wed, 25 Jul 2018 17:06:38 GMT
Hi again,

Thanks Karl, I was able of doing that after defining some "login sequence",
but also after filling database (cookiedata table) with certain values due
to "domain constrictions".
Before every web call, I suspect Manifold only takes cookies from URL exact
subdomain (i.e. x.y.z.com), so if you define your cookie as "z.com" it
won't be sent, so I added every subdomain by hand and started to work.

Regards.


El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
gustavo.beneitez@gmail.com>) escribió:

> Hi,
>
> thanks a lot, please let me check then the documentation for an example of
> that.
>
> Regards!
>
> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<daddywri@gmail.com>)
> escribió:
>
>> You are correct that cookies are not shared among threads.  That is by
>> design.
>>
>> The only way to set cookies for the WebConnector is to have there be a
>> "login sequence".  The login sequence sets cookies that are then used by
>> all subsequent fetches.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I have tried to look for an answer before writing this email, no luck.
>>> Sorry for the inconvenience if it is already answered.
>>>
>>> I need to set a cookie at the begining of the web crawling. The cookie
>>> rules the language you get the content, and while there are several
>>> choices, if no cookie is found there will be a "default language".
>>>
>>> I made a JSP which sets the cookie and contains several links (href),
>>> and pointed ManifoldCF to this page as the repository seed. I expected to
>>> get the crawling engine starting to capture links with correct language
>>> indicated by the cookie, but what I really got is a lot of content shown in
>>> default language.
>>>
>>> What I think about that is that cookies are not shared between thread
>>> spiders, so it is not possible to get cookies remain between links. Cookie
>>> domain is correct, also cookie expiration
>>>
>>> I would appreciate so much  if you can help me on this.
>>>
>>> Thanks in advance!
>>>
>>>
>>>

Mime
View raw message