nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhang JinYan (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.
Date Tue, 01 Nov 2011 15:53:32 GMT
MoreIndexingFilter refactor: move data formats used to parse "lastModified" to a config file.
---------------------------------------------------------------------------------------------

                 Key: NUTCH-1190
                 URL: https://issues.apache.org/jira/browse/NUTCH-1190
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
    Affects Versions: 1.4
         Environment: jdk6
            Reporter: Zhang JinYan


There many issues about missing date format:
[NUTCH-871|https://issues.apache.org/jira/browse/NUTCH-871]
[NUTCH-912|https://issues.apache.org/jira/browse/NUTCH-912]
[NUTCH-1015|https://issues.apache.org/jira/browse/NUTCH-1015]

The data formats can be diverse, so why not move those data formats to a extra config file?
I move all the data formats from "MoreIndexingFilter.java" to a file named "date-styles.txt",
which will be load on startup.
{code}
  public void setConf(Configuration conf) {
    this.conf = conf;
    MIME = new MimeUtil(conf);
    
    URL res = conf.getResource("date-styles.txt");
    if(res==null){
      LOG.error("Can't find resource: date-styles.txt");
    }else{
      try {
        List lines = FileUtils.readLines(new File(res.getFile()));
        for (int i = 0; i < lines.size(); i++) {
          String dateStyle = (String) lines.get(i);
          if(StringUtils.isBlank(dateStyle)){
            lines.remove(i);
            i--;
            continue;
          }
          dateStyle=StringUtils.trim(dateStyle);
          if(dateStyle.startsWith("#")){
            lines.remove(i);
            i--;
            continue;
          }
          lines.set(i, dateStyle);
        }
        dateStyles = new String[lines.size()];
        lines.toArray(dateStyles);
      } catch (IOException e) {
        LOG.error("Failed to load resource: date-styles.txt");
      }
    }
  }
{code}
Then parse "lastModified" like this(sample):
{code}
  private long getTime(String date, String url) {
    ......
    Date parsedDate = DateUtils.parseDate(date, dateStyles);
    time = parsedDate.getTime();
    ......
    return time;
  }
{code}
This path also contains the "path" of [NUTCH-1140|https://issues.apache.org/jira/browse/NUTCH-1140].
Find more details in the patch file.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message