lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1925) CSV Response Writer
Date Thu, 15 Jul 2010 01:53:52 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888670#action_12888670
] 

Chris A. Mattmann commented on SOLR-1925:
-----------------------------------------

Hi Yonik:

Thanks. Replies below:

{quote}
    *  loses info by removing newlines
{quote}

Only does this when {noformat}&excel=true{noformat}, and actually adds functionality in
doing so (without doing this, you can't load the data into Excel, see my comments above and
in the code).

{quote}
    * always encapsulates with quotes - not as readable
{quote}

See the CSV spec, via Wikipedia in the links in the code. Doing so reduces ambiguity, and
clearly delineates where the value starts, and where it stops.

{quote}
    * doesn't escape encapsulator in values
{quote}

Is there a need to do this? I don't think so...

{quote}
    * doesn't escape separator in multi-valued fields
{quote}

Same as above: no need, really.

{quote}
    * isn't really nested CSV, so it's not compatible with the CSVLoader
{quote}

What do you mean not compatible with CSV loader?

{quote}
    * uses System.getProperty("line.separator")... we should avoid different behavior on different
platforms
{quote}

Hmm, I've never been dinged before for writing platform independent code. That's what they
put the property in there, so line.separator means the same thing, programming-construct wise,
across platforms. So, I don't really get your ding here.

{quote}
    * doesn't stream documents (dumping your entire index will be one use case)
{quote}

I actually implemented both the streaming method (#writeDoc) and the aggregate method (#writeAllDocs).
I set #isStreaming to false, because it makes for a clean CSV header writing, rather than
hacky code in #writeDoc to take care of the (potential) non-uniformity. Additionally, I'm
using this in production right now, on solr-1.5 branch with an index of over 1M documents,
and the performance overhead for the write is quite fast.

{quote}
    * performance: patterns shouldn't be compiled per-doc
{quote}

This only matters when {noformat}excel=true{noformat}, and I think the performance hit isn't
really an issue. If you feel strongly about it though we could always compile the pattern
above the loop, and reuse it...

> CSV Response Writer
> -------------------
>
>                 Key: SOLR-1925
>                 URL: https://issues.apache.org/jira/browse/SOLR-1925
>             Project: Solr
>          Issue Type: New Feature
>          Components: Response Writers
>         Environment: indep. of env.
>            Reporter: Chris A. Mattmann
>            Assignee: Erik Hatcher
>             Fix For: Next
>
>         Attachments: SOLR-1925.Chheng.071410.patch.txt, SOLR-1925.Mattmann.053010.patch.2.txt,
SOLR-1925.Mattmann.053010.patch.3.txt, SOLR-1925.Mattmann.053010.patch.txt, SOLR-1925.Mattmann.061110.patch.txt
>
>
> As part of some work I'm doing, I put together a CSV Response Writer. It currently takes
all the docs resultant from a query and then outputs their metadata in simple CSV format.
The use of a delimeter is configurable (by default if there are multiple values for a particular
field they are separated with a | symbol).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message