commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Cohen <sco...@javactivity.org>
Subject Re: [net] FTP client date parsing: new format
Date Sat, 16 Apr 2005 21:31:04 GMT
Well, now I AM satisfied.  I believe this is about as good as can be 
done using regular expressions.  I tried to do as little date format 
validation as possible in the regex, but it is inevitable that some must 
be done.  The previous code prior to Neeme Praks' discoveries relied on 
an assumption that was too good to be true and therefore brittle.  That 
assumption was that the date portion of the regex could ALWAYS be 
determined by looking for three whitespace-delimited tokens in the right 
place.

The new algorithm is that a date is either
a) a single token with three all-numeric portions delimited by "-" or "/"
OR
b) two whitespace-delimited tokens
FOLLOWED by whitespace
FOLLOWED by a single token that is either all numeric (the year) or two 
numeric portions delimited by a colon (the time).

All existing JUnit tests pass and no length or date-specific assumptions 
are made anywhere in the regex, leaving all such decisions to the 
DateFormat objects.

Steve Cohen wrote:
> Okay, we've solved the immediate issues here but I'm not totally 
> satisfied yet.  The problem is that the numeric date format has 
> introduced a new logical possibility.  Formerly it was simple and clear 
> - either with the default or recent date formats there were always THREE 
> whitespace-separated components of the date (month day year OR month day 
> time).  The newly-introduced numeric date format in unix ftp servers 
> (about time, by the way) adds a new possibility of a timestamp composed 
> of TWO whitespace-separated components.  But until this becomes 
> widespread and probably forever, we'll have to maintain backward 
> compatibility with the older non-numeric formats.
> 
> Making the third token optional is reasonable, but as we have seen, 
> Neeme's find of the symbolic link case defeats this simple attempt at a 
> fix.  With the two-or-three token date, it is possible that the regex 
> engine will find an extra token later on and screw up the logic.
> 
> My current solution relies on the fact that a unix filename is not 
> supposed to start with a hyphen.  ([^-\\s]\\S*)  So "->", the symlink 
> indicator, will not be mistaken for a filename.  But that still doesn't 
> feel solid enough.
> 
> I would feel better if we had a more solid regex that clearly captured 
> what is and what is not a legal unix filename.  Googling did not find an 
> immediate answer to this questions, nor did I find one in Jeffrey 
> Friedl's "Mastering Regular Expressions" book.  Does anyone have one?
> 
> 
> Steve Cohen wrote:
> 
>> Sorry for being a bit brusque before but if you check out the latest 
>> code I think you will find that with Rory's and my changes, your 
>> issues are cared for.
>>
>>
>> Neeme Praks wrote:
>>
>>>
>>> ok, now I checked out the recent changes and the fix seems to work, 
>>> at least in the case of usual files:
>>> -rw-r-----   1 neeme neeme   346 2005-04-08 11:22 services.vsp
>>> is parsed into:
>>>    typeStr=-
>>>    hardLinkCount=1
>>>    usr=neeme
>>>    grp=neeme
>>>    filesize=346
>>>    datestr=2005-04-08 11:22
>>>    name=services.vsp
>>>    endtoken=
>>> And this is correct.
>>>
>>> However, it still breaks in the case of symbolic links.
>>> So, if the entry is a symbolic link:
>>> lrwxrwxrwx   1 neeme neeme    23 2005-03-02 18:06 macros -> 
>>> ./../../global/macros/.
>>> then it is parsed into these variables:
>>>   typeStr=l
>>>   hardLinkCount=1
>>>   usr=neeme
>>>   grp=neeme
>>>   filesize=23
>>>   datestr=2005-03-02 18:06 macros
>>>   name=->
>>>   endtoken= ../../../global/macros/
>>>
>>> The ending of "-> ../../../global/macros/" seems to confuse the 
>>> regexp parser.
>>>
>>> And to answer Rorys question about the specifics of the FTP server, 
>>> I'll paste one of my earlier posts here:
>>> This format is from the default FTP server daemon configuration that 
>>> came with Debian:
>>> Connected to stf.
>>> 220 stf FTP server (Version 6.4/OpenBSD/Linux-ftpd-0.17) ready.
>>> Name (stf:neeme): neeme
>>> 331 Password required for neeme.
>>> Password:
>>> 230- Linux stf 2.6.11 #1 SMP Wed Mar 2 14:08:21 CET 2005 i686 GNU/Linux
>>> 230-
>>> 230- The programs included with the Debian GNU/Linux system are free 
>>> software;
>>> 230- the exact distribution terms for each program are described in the
>>> 230- individual files in /usr/share/doc/*/copyright.
>>> 230-
>>> 230- Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
>>> 230- permitted by applicable law.
>>> 230 User neeme logged in.
>>> Remote system type is UNIX.
>>> Using binary mode to transfer files.
>>> ftp>
>>>
>>> Rgds,
>>> Neeme
>>>
>>> Neeme Praks wrote:
>>>
>>>>
>>>> AFAIK, the new system uses both: regexp for extracting the timestamp 
>>>> from the entry line and then using DateFormat to parse that.
>>>> Example:
>>>> -rw-r--r--    1 1000     1000           27 Jan 24 11:31 messages.vsp
>>>> from this line the regexp extracts the timestamp part ("Jan 24 
>>>> 11:31") and then DateFormat is used to parse this to a valid Date 
>>>> object.
>>>> The issue here is that the failure is already at regexp matching, 
>>>> and the code never reaches the DateFormat parsing part.
>>>>
>>>> I'll try to check out Rory's changes during the weekend.
>>>>
>>>> Rgds,
>>>> Neeme
>>>>
>>>> Steve Cohen wrote:
>>>>
>>>>> No, that's not it at all.  Remember that the new system does not 
>>>>> use Regexes for date parsing, it uses SimpleDateFormats.  From Mr. 
>>>>> Praks' descriptions, I am assuming he's now running the 1.3 or 
>>>>> earlier versions, which do use regexes.
>>>>>
>>>>> This surprises me because I've had several conversations with him 
>>>>> over the past month in which the new system was discussed.  Perhaps 
>>>>> he forgot to specify the date format as "yyyy/MM/dd" in his 
>>>>> FTPClientConfig this time?  Or perhaps his code is finding an older 
>>>>> commons-net.jar than he was expecting?
>>>>>
>>>>> Steve Cohen
>>>>>
>>>>> Rory Winston wrote:
>>>>>
>>>>>> Right, the problem with this format is that the date is not 
>>>>>> composed of three discrete components (from a regex POV), but two.

>>>>>> Basically what we will need to do is expand the regex to handle 
>>>>>> thuis - can you give me details of the FTP server operating system

>>>>>> and FTP server software version if you have it please.
>>>>>>
>>>>>> Cheers
>>>>>> Rory
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>>
>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message