incubator-odf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Weir (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ODFTOOLKIT-308) GSoC: ODF Command Line Tools
Date Mon, 19 Mar 2012 21:29:38 GMT

     [ https://issues.apache.org/jira/browse/ODFTOOLKIT-308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Rob Weir updated ODFTOOLKIT-308:
--------------------------------

    Description: 
==Background on our open source project==

The Apache ODF Toolkit is a set of Java modules that allow programmatic creation, scanning
and manipulation of Open Document Format (ISO/IEC 26300 == ODF) documents. Unlike other approaches
which rely on runtime manipulation of heavy-weight editors via an automation interface, the
ODF Toolkit is lightweight and ideal for server use. 

http://incubator.apache.org/odftoolkit/index.html

==The Idea==

GNU/Linux, and UNIX before then has shown the great power of a text processing via simple
command line tools, combined with operating facilities for piping and redirection. This filter-baed
text processing is what makes shell programming so powerful.  But it only works well for pure
text documents.  But what about more complex, WYSIWYG documents, spreadsheets, word processors,
with more complex formats?  The existing tool set becomes far weaker.

The Apache ODF Toolkit is a Java API that gives a high level view of a document, and enables
programmatic manipulation of a document.  We have functions for doing things like search &
replace, adding paragraphs, accessing cells in a spreadsheeting, etc., all from a Java application.
 No traditional editors is involved.  Pure Java, stuff you could run on a server even.

You can look at our "cookbook" for examples of our "Simple API" in action:

http://incubator.apache.org/odftoolkit/simple/document/cookbook/index.html


There is a lot you can do using this API.  But it still requires Java programming, and that
limits its reach to professional programmers.

What if we could write, using the ODF Toolkit, a set of command line utilities that made it
easy to do both simple and complex text manipulation tasks form a command line, things like:

1) Concatenate documents
2) Replace slide 3 in presentation A with slide 3 from presentation B
3) Apply the styles of document A to all documents in the current directory
4) Find all occurrences of "sausages" in the given document and add a hyperlink to sausages.com

and so on.

The audience for such a tool could be:

1) Data wranglers, who want to extract information from a large number of ODF documents. 

2) Power users who want to automate some repetitive document automation tasks, like filling
in a template,or an off-line mail merge

3) QA testers of office editors, who use simple scripts to generate test cases as well as
to test editor-generated documents for correctness

4) Web developers who want to generate a data-driven document on-the-fly 

So think generally in that space. Not system programmers.  Not application developers.  But
command line gurus, with a little scripting ability at most.  That is the  "sweet spot".

Some technical aspects you might want to consider:

1)    The real value of the Unix text utilities is that they could easily be combined. For
example, I recently did this to search for all openoffice.org email address on downloaded
copy of the openoffice website, deduping and sorting by how many times each address appeared:


grep -o -r -i --no-filename --include="*.html" "[[:alnum:]+\.\_\-]*@openoffice.org" . | sort
| uniq -c | sort -n -r

So, powerful command line tools that each do one thing well. And then a way to pipe the outputs
of one to become the inputs of another.   Can we define a similar set of basic operations
on ODF documents, as well as the glue to combine these commands into more powerful pipelines?


2) Useful example tools are cat, grep, diff and sed, etc.. Maybe even something awk-like that
works with spreadsheets?  No need to be slavish to the original tools, but create something
of similar power, but which operate on ODF documents.

3)  The trick will be that an ODF document is a ZIP file containing multiple XML files, and
possibly other resources, like JPG images. If we pipe the binary ZIP, then we're forcing each
tool in the chain to do the uncompress/compress, which is bad for performance. There is also
the issue of repeated parsing/serialization of the XML.  So how can we do this all efficiently?
 


Note:  These are just ideas to get you thinking in this general area. I would be pleased to
review any GSoC proposals related to the ODF Toolkit.

  was:
GNU/Linux, and UNIX before then has shown the great power of a text processing via simple
command line tools, combined with operating facilities for piping and redirection. This filter-baed
text processing is what makes shell programming so powerful.  But it only works well for text
documents.  But what about more complex, WYSIWYG documents, spreadsheets, word processors,
with more complex formats, often not text based at all?  The tool set becomes far weaker.

The Apache ODF Toolkit is a Java API that gives a high level view of a document, and enables
programmatic manipulation of a document.  We have functions for doing things like search &
replace.  There is a lot you can do using the ODF Toolkit.  But it still requires Java programming,
and that limits its reach to professional programmers.

What if we could write, using the ODF Toolkit, a set of command line utilities that made it
easy to do both simple and complex text manipulation tasks form a command line, things like:

1) Concatenate documents
2) Replace slide 3 in presentation A with slide 3 from presentation B
3) Apply the styles of document A to all documents in the current directory
4) Find all occurances of "sausages" in the given document and add a hyperlink to sausages.com

and so on.

Clearly analogs of cat, grep, diff and sed are obvious ones. Maybe something awk-like that
works with spreadsheets?  No need to be slavish to the original tools, but create something
of similar power, but which operate on ODF documents.  For example, an alternative solution
might be to write a new shell processor that has native commands for ODF document manipulation.

    
> GSoC:  ODF Command Line Tools
> -----------------------------
>
>                 Key: ODFTOOLKIT-308
>                 URL: https://issues.apache.org/jira/browse/ODFTOOLKIT-308
>             Project: ODF Toolkit
>          Issue Type: New Feature
>            Reporter: Rob Weir
>            Assignee: Rob Weir
>              Labels: gsoc2012, mentor
>
> ==Background on our open source project==
> The Apache ODF Toolkit is a set of Java modules that allow programmatic creation, scanning
and manipulation of Open Document Format (ISO/IEC 26300 == ODF) documents. Unlike other approaches
which rely on runtime manipulation of heavy-weight editors via an automation interface, the
ODF Toolkit is lightweight and ideal for server use. 
> http://incubator.apache.org/odftoolkit/index.html
> ==The Idea==
> GNU/Linux, and UNIX before then has shown the great power of a text processing via simple
command line tools, combined with operating facilities for piping and redirection. This filter-baed
text processing is what makes shell programming so powerful.  But it only works well for pure
text documents.  But what about more complex, WYSIWYG documents, spreadsheets, word processors,
with more complex formats?  The existing tool set becomes far weaker.
> The Apache ODF Toolkit is a Java API that gives a high level view of a document, and
enables programmatic manipulation of a document.  We have functions for doing things like
search & replace, adding paragraphs, accessing cells in a spreadsheeting, etc., all from
a Java application.  No traditional editors is involved.  Pure Java, stuff you could run on
a server even.
> You can look at our "cookbook" for examples of our "Simple API" in action:
> http://incubator.apache.org/odftoolkit/simple/document/cookbook/index.html
> There is a lot you can do using this API.  But it still requires Java programming, and
that limits its reach to professional programmers.
> What if we could write, using the ODF Toolkit, a set of command line utilities that made
it easy to do both simple and complex text manipulation tasks form a command line, things
like:
> 1) Concatenate documents
> 2) Replace slide 3 in presentation A with slide 3 from presentation B
> 3) Apply the styles of document A to all documents in the current directory
> 4) Find all occurrences of "sausages" in the given document and add a hyperlink to sausages.com
> and so on.
> The audience for such a tool could be:
> 1) Data wranglers, who want to extract information from a large number of ODF documents.

> 2) Power users who want to automate some repetitive document automation tasks, like filling
in a template,or an off-line mail merge
> 3) QA testers of office editors, who use simple scripts to generate test cases as well
as to test editor-generated documents for correctness
> 4) Web developers who want to generate a data-driven document on-the-fly 
> So think generally in that space. Not system programmers.  Not application developers.
 But command line gurus, with a little scripting ability at most.  That is the  "sweet spot".
> Some technical aspects you might want to consider:
> 1)    The real value of the Unix text utilities is that they could easily be combined.
For example, I recently did this to search for all openoffice.org email address on downloaded
copy of the openoffice website, deduping and sorting by how many times each address appeared:
> grep -o -r -i --no-filename --include="*.html" "[[:alnum:]+\.\_\-]*@openoffice.org" .
| sort | uniq -c | sort -n -r
> So, powerful command line tools that each do one thing well. And then a way to pipe the
outputs of one to become the inputs of another.   Can we define a similar set of basic operations
on ODF documents, as well as the glue to combine these commands into more powerful pipelines?
> 2) Useful example tools are cat, grep, diff and sed, etc.. Maybe even something awk-like
that works with spreadsheets?  No need to be slavish to the original tools, but create something
of similar power, but which operate on ODF documents.
> 3)  The trick will be that an ODF document is a ZIP file containing multiple XML files,
and possibly other resources, like JPG images. If we pipe the binary ZIP, then we're forcing
each tool in the chain to do the uncompress/compress, which is bad for performance. There
is also the issue of repeated parsing/serialization of the XML.  So how can we do this all
efficiently?  
> Note:  These are just ideas to get you thinking in this general area. I would be pleased
to review any GSoC proposals related to the ODF Toolkit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message