nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-439) Top Level Domains Indexing / Scoring
Date Tue, 06 Feb 2007 14:04:05 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v1.0.patch

This is a plugin implementation for indexing and scoring top level domains in nutch. Tlds
are stored in TLDEntry class, which has fields domain, status and boost fileds. The tlds are
read from an xml file. There is also a xsd for validation. 

TLDIndexingFilter implements IndexingFilter interface to index the domain extensions (such
as "net", "org", "en", "de") in the tld field. 

TLDScoringFilter implements ScoringFilter interface. Basically this filter multiplies the
initial boost(coming from another scoring filter such as opic) by the boost of the domain.
This way, by configuring boost of say "edu" domains to 1.1, the document boosts in the index
of educational sites is boosted by 1.1. Also local search engines may wish to boost the domains
hosted in that country. For ex. boosting "de" domains a little in a German SE seems reasonable.
An alternative usage may be to lower the boosts of domains such as biz, or info, which are
known to have lots of spam. 

The users can also query the tld field for advanced search. 

Implementation note : 1. OpicScoringFilter is changed to respect ScoringFilter chaining. 
                                        2. some of the second level domains such as co.uk
is not recognized, but edu.uk is recognized
                                        



> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS system. TLDs
are managed by the Internet Assigned Numbers Authority. IANA divides tlds into three. infrastructure,
generic(such as "com", "edu") and country code tlds(such as "en", "de" , "tr", ). Indexing
the top level domain and optionally boosting is needed for improving the search results and
enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message