cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Answer <A.Nuzh...@ftc.ru>
Subject Re[2]: text parser
Date Wed, 13 Feb 2002 12:15:11 GMT
Hello Stephan,

>*************Original message*************
> From: Stephan Michels <stephan@vern.chem.tu-berlin.de>
> To: Stephan Michels <cocoon-dev@xml.apache.org>
> Date: Wednesday, February 13, 2002, 5:18:34 PM
> Subject: text parser


> On Wed, 13 Feb 2002, Andrew Answer wrote:

>> Hello Stephan,
>>
>>   is a good idea! Now i converting many text documents to XML by using
>>   PHP scripts offline...
>>   Some names for your parser: txt2xml (simply and clear),

> There exists already a project this this name:
> http://xml.gsfc.nasa.gov/ingest_demo/txt2XML.html

>>   JTF (Java Text Formatter),

> Look an JTF.org:Jewish Task Force ;-)

>>   JTC (Java Text Converter).

> http://www.jtc.com/ is also given

> Finding a name isn't so easy as I think. :(

  Hmmm.... why you reject project name if domain-name are reserved?
  Number of projects more than number of domains :)

>>   Also look at the APTConvert
>>   (http://www.xmlmind.com/aptconvert/distrib/docs/userguidetoc.html),
>>   may be this tool can help you.

> I think my project could help you.

> A example grammar looks like:
> <grammar>
>  <tokens>
>   <token tsymbol="id">
>    <concat>
>     <cc><ci min="A" max="Z"/><ci min="a" max="z"/></cc>
>     <cc minOccurs="0" maxOccurs="*">
>      <ci min="A" max="Z"/><ci min="a" max="z"/><ci min="0" max="9"/>
>      <cs content="_"/>
>     </cc>
>    </concat>
>   </token>

>   <token tsymbol="mult" assoc="right">
>    <string content="*"/>
>   </token>

>   <token tsymbol="plus" assoc="left">
>    <string content="+"/>
>   </token>

>   <token tsymbol="dopen">
>    <string content="("/>
>   </token>

>   <token tsymbol="dclose">
>    <string content=")"/>
>   </token>

>  </tokens>

>  <whitespace>
>   <cc maxOccurs="*"><cs content="&#10;&#13;&#9;&#32;"/></cc>
>  </whitespace>

>  <productions>

>   <production ntsymbol="E">
>    <ntsymbol name="E"/><tsymbol name="plus"/><ntsymbol name="E"/>
>   </production>

>   <production ntsymbol="E">
>    <ntsymbol name="E"/><tsymbol name="mult"/><ntsymbol name="E"/>
>   </production>

>   <production ntsymbol="E">
>    <tsymbol name="dopen"/><ntsymbol name="E"/><tsymbol name="dclose"/>
>   </production>

>   <production ntsymbol="E">
>    <tsymbol name="id"/>
>   </production>

>  </productions>

>  <ssymbol ntsymbol="E"/>
> </grammar>

> This grammar converts the string "A*b+c*D+(e+F)*G" to

> <E>
>  <E>
>   <E>
>    <E>
>     <id>A</id>
>    </E>
>    <mult>*</mult>
>    <E>
>     <id>b</id>
>    </E>
>   </E>
>   <plus>+</plus>
>   <E>
>    <E>
>     <id>c</id>
>    </E>
>    <mult>*</mult>
>    <E>
>     <id>D</id>
>    </E>
>   </E>
>  </E>
>  <plus>+</plus>
>  <E>
>   <E>
>    <dopen>(</dopen>
>    <E>
>     <E>
>      <id>e</id>
>     </E>
>     <plus>+</plus>
>     <E>
>      <id>F</id>
>     </E>
>    </E>
>    <dclose>)</dclose>
>   </E>
>   <mult>*</mult>
>   <E>
>    <id>G</id>
>   </E>
>  </E>
> </E>

  Well-driven engine! It's look like XML parser...
  Suggestions:
  I'm worked with byacc/flex, but already forget his syntax. May be
  better to make DTD of your grammar more readable?
  Then, you can even write stylesheet for converting byacc grammar
  into your grammar. And use it with your parser - it's a good test.
  How about whitespaces? Unlike XML, text files need to recognize one
  or two CR/LF and apply different rules, etc...
  May be you can to produce one text from another (line formatting,
  adjusting, lists formatting, etc)?
  And later you can transmute it into Generator/Transformer (but you must
  produce SAX stream for right work, i think)...
  
  Happy hacking!

Best regards,
  Andrew Answer               A.Nuzhdov@ftc.ru


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Mime
View raw message