any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "McBennett, Pat" <McBenne...@DNB.com.INVALID>
Subject How to configure Any23 programmatically in Java?
Date Wed, 29 Apr 2015 16:11:31 GMT
Hi,

I've just started trying to use Any23 programmatically from Java, and it looks great.
The documentation has sample code [1], but that code seems out-of-date (the webpage it attempts
to extract from (http://www.rentalinrome.com/semanticloft/semanticloft.htm) has changed I
think), and it has a syntax error (the word 'Apache' appears twice on line 1, which doesn't
make any sense).

My questions are simply:

1.      How do I configure the 'Any23' instance in this code? I know the constructor takes
a Properties instance, but where are the currently supported properties documented? For instance,
how do I set the timeout for the connection attempt?

2.      This code sample doesn't seem to crawl from the webpage I provide - it just scans
that one page. So is there a code sample for crawling a website (with code to show how to
configure the MaxPages and MaxDepth)?

Thanks,

Pat.

[1] - http://any23.apache.org/dev-data-extraction.html


[cid:image001.png@01D08297.0F603420]

Pat McBennett
Architect
The Chase Building, 5th Floor
Carmanhall Road, Sandyford,
Dublin 18, Ireland
Direct +353 1
Mobile +353 8

http://www.dnb.co.uk/

[cid:image002.png@01D08297.0F603420]<http://www.facebook.com/DunBradstreet>[cid:image003.png@01D08297.0F603420]<http://twitter.com/dnbus>[cid:image004.png@01D08297.0F603420]<http://www.linkedin.com/company/dun-&-bradstreet>[cid:image005.png@01D08297.0F603420]<http://www.youtube.com/user/DunandBrad>
[cid:image006.png@01D08297.0F603420]

The information contained in this electronic message and any attachments (the "Message") is
intended for one or more specific individuals or entities, and may be confidential, proprietary,
privileged or otherwise protected by law. If you are not the intended recipient (or you are
not authorised to receive for the recipient), please notify the sender immediately, delete
this Message and do not disclose, distribute, or copy it to any third party or otherwise use
this Message. Electronic messages are not secure or error free and can contain viruses or
may be delayed and the sender is not liable for any of these occurrences. The sender reserves
the right to monitor, record, transfer cross border and retain electronic messages.
"D&B" is a trading style of D&B Business Information Solutions is registered in Ireland.
www.dnb.co.uk



Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message