incubator-droids-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingfai Ma (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DROIDS-54) Make LinkTask supports arbitrary data by extends HashMap, and consider to refactor Task, Link, and LinkTask
Date Thu, 18 Jun 2009 10:51:07 GMT

     [ https://issues.apache.org/jira/browse/DROIDS-54?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mingfai Ma updated DROIDS-54:
-----------------------------

    Attachment: SampleLink.java

attached is a sample implementation for review

 - we still can make a LinkTask extend this base Link class, or just add more method to this
class (and optionally change it to LinkTask)
 - it stores url as String, but the constructor always call new URI() to ensure the url string
is valid in construction time.
 - stuff like toString, equals and hashCode maybe deleted in the final implementation. or
change them to follow this project's standard.
 - a few convenient method are added, such as getHost(), getURI(), resolve(String) are added.
for resolve, it's added just like the URI has a resolve method. using a LinkResolver with
the same base URI could be slightly more efficient.

for me, i am using a crawler derived from Droids, and I make the all usage of Link as <T
extends Link>. e.g. LinkQueue<T extends Link> extends PriorityBlockingQueue<T>.
This also could be considered. 



> Make LinkTask supports arbitrary data by extends HashMap, and consider to refactor Task,
Link, and LinkTask
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: DROIDS-54
>                 URL: https://issues.apache.org/jira/browse/DROIDS-54
>             Project: Droids
>          Issue Type: New Feature
>          Components: core
>    Affects Versions: 0.01
>            Reporter: Mingfai Ma
>         Attachments: SampleLink.java
>
>
> refer to the initial idea at:
> https://issues.apache.org/jira/browse/DROIDS-48?focusedCommentId=12721121&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12721121
> The current implementation of LinkTask
> {code}
> public class LinkTask implements Link, Serializable
> {
>   private Date started;
>   private final int depth;
>   private final URI uri;
>   private final Link from;
>   
>   private Date lastModifedDate;
>   private Collection<URI> linksTo;
>   private String anchorText;
>   private int weight;
> {code}
> Suggested change:
> {code}
> public class LinkTask extends HashMap<String, Serializable> 
> or
> public class LinkTask extends HashMap<String, Serializable> implements Link
> {code}
> The minimum required attributes are:
>  - final ? id, 
>    - mainly to have a minimum size value as hash key and store in memory/data grid for
lookup, e.g. for use as history to avoid duplicated fetching. refer to DROIDS-53 
>  - final String url
>    - the original String representation of the URL (preferred), or java.net.URI representation
with the encoded string (seems no good).
>    - the url is the original one provided by the user in construction. two diff url may
refer to the same url, e.g. http://www.apache.org and http://www.apache.org/, it's up to the
user to decide if they should be normalized. (and they could use the URL/LinkNormalizer in
DROIDS-45
> the other fields are basically optional. 
>   - started/taskDate, if the queue use it for sorting, then it's useful, otherwise, it's
just for logging.
>   -  "weight" is another example that not all implementation may need. 
>   - "linksTo", a.k.a. outLinks, is also optional to be attached to the LinkTask. an implementation
may extract the outlink and put them in queue directly without storing the outlinks in the
LinkTask. 
>   - "from", a.k.a. referrer, should not store the Link reference as it will affect GC.

> btw, should we also simplify Link, Task and LinkTask?  if we use a Map, it's very generic
already. Link and Task could be different concepts if we need to use them separately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message