Groups > comp.lang.python > #57487 > unrolled thread

Processing large CSV files - how to maximise throughput?

Started by	Victor Hooi <victorhooi@gmail.com>
First post	2013-10-24 18:38 -0700
Last post	2013-10-26 08:53 +0000
Articles	14 — 9 participants

Back to article view | Back to comp.lang.python

  Processing large CSV files - how to maximise throughput? Victor Hooi <victorhooi@gmail.com> - 2013-10-24 18:38 -0700
    Re: Processing large CSV files - how to maximise throughput? Dave Angel <davea@davea.name> - 2013-10-25 02:10 +0000
      Re: Processing large CSV files - how to maximise throughput? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-25 03:35 +0000
        Re: Processing large CSV files - how to maximise throughput? Dave Angel <davea@davea.name> - 2013-10-25 03:57 +0000
        Re: Processing large CSV files - how to maximise throughput? Chris Angelico <rosuav@gmail.com> - 2013-10-25 17:13 +1100
        Re: Processing large CSV files - how to maximise throughput? Stefan Behnel <stefan_ml@behnel.de> - 2013-10-25 08:39 +0200
        Re: Processing large CSV files - how to maximise throughput? Chris Angelico <rosuav@gmail.com> - 2013-10-25 18:26 +1100
        Re: Processing large CSV files - how to maximise throughput? Dave Angel <davea@davea.name> - 2013-10-25 11:24 +0000
        Re: Processing large CSV files - how to maximise throughput? Chris Angelico <rosuav@gmail.com> - 2013-10-25 22:42 +1100
    Re: Processing large CSV files - how to maximise throughput? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-25 03:19 +0000
    Re: Processing large CSV files - how to maximise throughput? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-25 04:46 +0100
    Re: Processing large CSV files - how to maximise throughput? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-10-25 19:44 -0400
      Re: Processing large CSV files - how to maximise throughput? Roy Smith <roy@panix.com> - 2013-10-25 20:22 -0400
    Re: Processing large CSV files - how to maximise throughput? Walter Hurry <walterhurry@lavabit.com> - 2013-10-26 08:53 +0000

#57487 — Processing large CSV files - how to maximise throughput?

From	Victor Hooi <victorhooi@gmail.com>
Date	2013-10-24 18:38 -0700
Subject	Processing large CSV files - how to maximise throughput?
Message-ID	<b4737555-cb4f-457b-aed7-a1e6553fe6a5@googlegroups.com>

Hi,

We have a directory of large CSV files that we'd like to process in Python.

We process each input CSV, then generate a corresponding output CSV file.

input CSV -> munging text, lookups etc. -> output CSV

My question is, what's the most Pythonic way of handling this? (Which I'm assuming 

For the reading, I'd

    with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
        csv_writer = DictWriter(output)
        for line in DictReader(input):
            # Do some processing for that line...
            output = process_line(line)
            # Write output to file
            csv_writer.writerow(output)
            
So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good.

For the writing - my understanding is that it writes a line to the file object each loop iteration, however, this will only get flushed to disk every now and then, based on my system default buffer size, right?

So if the output file is going to get large, there isn't anything I need to take into account for conserving memory?

Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes.

If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput?

Cheers,
Victor

[toc] | [next] | [standalone]

#57488

From	Dave Angel <davea@davea.name>
Date	2013-10-25 02:10 +0000
Message-ID	<mailman.1494.1382667030.18130.python-list@python.org>
In reply to	#57487

On 24/10/2013 21:38, Victor Hooi wrote:

> Hi,
>
> We have a directory of large CSV files that we'd like to process in Python.
>
> We process each input CSV, then generate a corresponding output CSV file.
>
> input CSV -> munging text, lookups etc. -> output CSV
>
> My question is, what's the most Pythonic way of handling this? (Which I'm assuming 
>
> For the reading, I'd
>
>     with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
>         csv_writer = DictWriter(output)
>         for line in DictReader(input):
>             # Do some processing for that line...
>             output = process_line(line)
>             # Write output to file
>             csv_writer.writerow(output)
>             
> So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good.
>
> For the writing - my understanding is that it writes a line to the file object each loop iteration, however, this will only get flushed to disk every now and then, based on my system default buffer size, right?
>
> So if the output file is going to get large, there isn't anything I need to take into account for conserving memory?

No, the system will flush so often that you'll never use much memory.

>
> Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes.

If you want help optimizing process_line(), you'd have to show us the
source.  For the regex, you can precompile it and not have to build it
each time.  Or just write the equivalent Python code, which many times
is faster than a regex.


>
> If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput?

Start multiple processes.  For what you're doing, there's probably no
point in multithreading.

And as always, in performance tuning, you never know till you measure.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#57491

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-10-25 03:35 +0000
Message-ID	<5269e6f6$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to	#57488

On Fri, 25 Oct 2013 02:10:07 +0000, Dave Angel wrote:

>> If I have multiple large CSV files to deal with, and I'm on a
>> multi-core machine, is there anything else I can do to boost
>> throughput?
> 
> Start multiple processes.  For what you're doing, there's probably no
> point in multithreading.

Since the bottleneck will probably be I/O, reading and writing data from 
files, I expect threading actually may help.



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#57494

From	Dave Angel <davea@davea.name>
Date	2013-10-25 03:57 +0000
Message-ID	<mailman.1497.1382673461.18130.python-list@python.org>
In reply to	#57491

On 24/10/2013 23:35, Steven D'Aprano wrote:

> On Fri, 25 Oct 2013 02:10:07 +0000, Dave Angel wrote:
>
>>> If I have multiple large CSV files to deal with, and I'm on a
>>> multi-core machine, is there anything else I can do to boost
>>> throughput?
>> 
>> Start multiple processes.  For what you're doing, there's probably no
>> point in multithreading.
>
> Since the bottleneck will probably be I/O, reading and writing data from 
> files, I expect threading actually may help.
>
>
>

We approach the tradeoff from opposite sides.  I would use
multiprocessing to utilize multiple cores unless the communication costs
(between the processes) would get too high.

They won't in this case.

But I would concur -- probably they'll both give about the same speedup.
I just detest the pain that multithreading can bring, and tend to avoid
it if at all possible.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#57495

From	Chris Angelico <rosuav@gmail.com>
Date	2013-10-25 17:13 +1100
Message-ID	<mailman.1498.1382681626.18130.python-list@python.org>
In reply to	#57491

On Fri, Oct 25, 2013 at 2:57 PM, Dave Angel <davea@davea.name> wrote:
> But I would concur -- probably they'll both give about the same speedup.
> I just detest the pain that multithreading can bring, and tend to avoid
> it if at all possible.

I don't have a history of major pain from threading. Is this a Python
thing, or have I just been really really fortunate (growing up on OS/2
rather than Windows has definitely been, for me, a major boon)?
Generally, I find threads to be convenient, though of course not
always useful (especially in serialized languages).

ChrisA

[toc] | [prev] | [next] | [standalone]

#57497

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2013-10-25 08:39 +0200
Message-ID	<mailman.1499.1382683179.18130.python-list@python.org>
In reply to	#57491

Chris Angelico, 25.10.2013 08:13:
> On Fri, Oct 25, 2013 at 2:57 PM, Dave Angel wrote:
>> But I would concur -- probably they'll both give about the same speedup.
>> I just detest the pain that multithreading can bring, and tend to avoid
>> it if at all possible.
> 
> I don't have a history of major pain from threading. Is this a Python
> thing, or have I just been really really fortunate

Likely the latter. Threads are ok if what they do is essentially what you
could easily use multiple processes for as well, i.e. process independent
data, maybe from/to independent files etc., using dedicated channels for
communication.

As soon as you need them to share any state, however, it's really easy to
get it wrong and to run into concurrency issues that are difficult to
reproduce and debug.

Basically, with multiple processes, you start with independent systems and
add connections specifically where needed, whereas with threads, you start
with completely shared state and then prune away interdependencies and
concurrency until it seems to work safely. That approach makes it
essentially impossible to prove that threading is safe in a given setup,
except for the really trivial cases.

Stefan

[toc] | [prev] | [next] | [standalone]

#57499

From	Chris Angelico <rosuav@gmail.com>
Date	2013-10-25 18:26 +1100
Message-ID	<mailman.1500.1382686313.18130.python-list@python.org>
In reply to	#57491

On Fri, Oct 25, 2013 at 5:39 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
> Basically, with multiple processes, you start with independent systems and
> add connections specifically where needed, whereas with threads, you start
> with completely shared state and then prune away interdependencies and
> concurrency until it seems to work safely. That approach makes it
> essentially impossible to prove that threading is safe in a given setup,
> except for the really trivial cases.

Not strictly true. With multiple threads, you start with completely
shared global state and completely independent local state (in
assembly language, shared data segment and separate stack). If you
treat your globals as either read-only or carefully controlled, then
it makes little difference whether you're forking processes or
spinning off threads, except that with threads you don't need special
data structures (IPC-based ones) for the global state.

For me, threading largely grew out of the same sorts of concerns as
recursion - as long as all your internal state is in locals, nothing
can hurt you. Of course, it's still far easier to shoot yourself in
the foot with threads than with processes, but for the tasks I've used
them for, I've never found footholes; that may, however, be inherent
to the simplicity of the two main jobs I used threads for: socket
handling (where nearly everything's I/O bound) and worker threads spun
off to let the GUI remain responsive (posting a message back to the
main thread when there's a result).

ChrisA

[toc] | [prev] | [next] | [standalone]

#57514

From	Dave Angel <davea@davea.name>
Date	2013-10-25 11:24 +0000
Message-ID	<mailman.1507.1382700309.18130.python-list@python.org>
In reply to	#57491

On 25/10/2013 02:13, Chris Angelico wrote:

> On Fri, Oct 25, 2013 at 2:57 PM, Dave Angel <davea@davea.name> wrote:
>> But I would concur -- probably they'll both give about the same speedup.
>> I just detest the pain that multithreading can bring, and tend to avoid
>> it if at all possible.
>
> I don't have a history of major pain from threading. Is this a Python
> thing,

No, all my pain has been in C++.  But nearly all my Python code has
been written solo, so I can adopt strict rules.

The problem comes that with many people's sticky fingers in the code
pie, things that seem innocuous are broken once they can happen from
multiple threads.  And C++ does not give you any way to tell at a glance
where you're asking for trouble.

I've also been involved in projects that existed and were broken long
before I came on the scene.  Finding that something is broken when it
wasn't caused by any recent change is painful.  And in large, old C++
projects, memory management is convoluted.  Even when not broken,
sometimes performance takes big hits because multiple threads are
banging at the same cache line. (using thread affinity to make sure
each thread uses a consistent core)  So you have a dozen memory
allocation strategies, each with different rules and restrictions.

> or have I just been really really fortunate (growing up on OS/2
> rather than Windows has definitely been, for me, a major boon)?
> Generally, I find threads to be convenient, though of course not
> always useful (especially in serialized languages).
>

All my heavy duty multi-task/multi-thread stuff has been on Linux.

And Python floats above most of the issues I'm trying to describe.
E.g. you don't have to tune the memory management because it's out of
your control. Still my bias persists.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#57516

From	Chris Angelico <rosuav@gmail.com>
Date	2013-10-25 22:42 +1100
Message-ID	<mailman.1509.1382701329.18130.python-list@python.org>
In reply to	#57491

On Fri, Oct 25, 2013 at 10:24 PM, Dave Angel <davea@davea.name> wrote:
> On 25/10/2013 02:13, Chris Angelico wrote:
>
>> On Fri, Oct 25, 2013 at 2:57 PM, Dave Angel <davea@davea.name> wrote:
>>> But I would concur -- probably they'll both give about the same speedup.
>>> I just detest the pain that multithreading can bring, and tend to avoid
>>> it if at all possible.
>>
>> I don't have a history of major pain from threading. Is this a Python
>> thing,
>
> No, all my pain has been in C++.  But nearly all my Python code has
> been written solo, so I can adopt strict rules.
>
> The problem comes that with many people's sticky fingers in the code
> pie, things that seem innocuous are broken once they can happen from
> multiple threads.  And C++ does not give you any way to tell at a glance
> where you're asking for trouble.

Yeah, that is a big issue. And even the simple rule I mentioned
earlier (keep everything on the stack) isn't sufficient when you use
library functions that aren't thread-safe... even slabs of the
standard library aren't, and I'm not just talking about obvious ones
like strtok().

> I've also been involved in projects that existed and were broken long
> before I came on the scene.  Finding that something is broken when it
> wasn't caused by any recent change is painful.  And in large, old C++
> projects, memory management is convoluted.  Even when not broken,
> sometimes performance takes big hits because multiple threads are
> banging at the same cache line. (using thread affinity to make sure
> each thread uses a consistent core)  So you have a dozen memory
> allocation strategies, each with different rules and restrictions.

Absolutely. Something might have just-happened to work, even for
years. I came across something that would almost be brown-paper-bag
quality, in my own production code at work, today; the TCP socket
protocol between two processes wasn't properly looking for the newline
that marks the end of the message, and would have bugged out badly if
ever two messages had been combined into one socket-read call. With
everything working perfectly, like on a low-latency LAN where all our
testing happens, it was all safe, but if anything lagged out, it would
have meant messages got lost. Yet it's been working for years without
a noticed glitch.

>> or have I just been really really fortunate (growing up on OS/2
>> rather than Windows has definitely been, for me, a major boon)?
>> Generally, I find threads to be convenient, though of course not
>> always useful (especially in serialized languages).
>
> All my heavy duty multi-task/multi-thread stuff has been on Linux.

OS/2 is very Unix-like in many areas, but it doesn't have a convenient
fork(), so processes are rather more fiddly to spawn than threads are.
Actually, sliding from OS/2 to Linux has largely led to me sliding
from threads to processes - forking a process under Linux is as easy
as spinning off a thread under OS/2, and as easy to pass state to
(though (obviously) harder to pass state back from).

> And Python floats above most of the issues I'm trying to describe.
> E.g. you don't have to tune the memory management because it's out of
> your control. Still my bias persists.

Hrmmmmm.... yes, I can imagine running into difficulties with a
non-thread-safe malloc/free implementation... I think that would rate
about a 4 on the XKCD 883 scale.

ChrisA

[toc] | [prev] | [next] | [standalone]

#57490

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-10-25 03:19 +0000
Message-ID	<5269e359$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to	#57487

On Thu, 24 Oct 2013 18:38:21 -0700, Victor Hooi wrote:

> Hi,
> 
> We have a directory of large CSV files that we'd like to process in
> Python.
> 
> We process each input CSV, then generate a corresponding output CSV
> file.
> 
> input CSV -> munging text, lookups etc. -> output CSV
> 
> My question is, what's the most Pythonic way of handling this? (Which
> I'm assuming

Start with the simplest thing that could work:

for infile, outfile in zip(input_list, output_list):
    for line in infile:
        munge line
        write to outfile

sort of thing. If and only if it isn't fast enough, then try to speed it 
up.

> For the reading, I'd
> 
>     with open('input.csv', 'r') as input, open('output.csv', 'w') as
>     output:
>         csv_writer = DictWriter(output)
>         for line in DictReader(input):
>             # Do some processing for that line... output =
>             process_line(line)
>             # Write output to file
>             csv_writer.writerow(output)

Looks good to me!

> So for the reading, it'll iterates over the lines one by one, and won't
> read it into memory which is good.
> 
> For the writing - my understanding is that it writes a line to the file
> object each loop iteration, however, this will only get flushed to disk
> every now and then, based on my system default buffer size, right?

Have you read the csv_writer documentation? 

http://docs.python.org/2/library/csv.html#writer-objects

Unfortunately it is pretty light documentation, but it seems that 
writer.writerow *probably* just calls write on the underlying file object 
immediately. 

But there's really no way to tell when the data hits the disk platter: 
the Python file object could be doing caching, the OS could be doing 
caching, the file system could be doing caching, and even the disk itself 
could be doing caching. Really, the only way to be sure that the data has 
hit the disk platter is to call os.sync(), and even then some hard drives 
lie and report that they're synced when in fact the data is still in 
volatile cache. Bad hard drive, no biscuit.

But, really, do you need to care? Better to buy better hard drives (e.g. 
server grade), or use software RAID with two different brands (so their 
failure characteristics will be different), or just be prepared to re-
process a batch of files if the power goes out mid-run. (You do have a 
UPS, don't you?)

Anyway, I wouldn't bother about calling os.sync directly, the OS will 
sync when it needs to. But if you need it, it's there. Or you can call 
flush() on the output file, which is a bit less invasive than calling 
os.sync. But really I wouldn't bother. Let the OS handle it.

> So if the output file is going to get large, there isn't anything I need
> to take into account for conserving memory?

I shouldn't think so. 

> Also, if I'm trying to maximise throughput of the above, is there
> anything I could try? The processing in process_line is quite line -
> just a bunch of string splits and regexes.
> 
> If I have multiple large CSV files to deal with, and I'm on a multi-core
> machine, is there anything else I can do to boost throughput?

Since this is likely to be I/O-bound, you could use threads. Each thread 
is responsible for reading the file, processing it, then writing it back 
again. Have you used threads before? If not, start here:

http://www.ibm.com/developerworks/aix/library/au-threadingpython/
http://pymotw.com/2/threading/‎

If the amount of processing required becomes heavier, and the task 
becomes CPU-bound, you can either:

- move to an implementation of Python without the GIL, like 
  IronPython or Jython;

- or use multiprocessing.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#57493

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-10-25 04:46 +0100
Message-ID	<mailman.1496.1382672801.18130.python-list@python.org>
In reply to	#57487

On 25/10/2013 02:38, Victor Hooi wrote:
>
> So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good.
>

Wow this is fantastic, which OS are you using?  Or do you actually mean 
that the whole file doesn't get read into memory, only one line at a 
time? :)

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#57577

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2013-10-25 19:44 -0400
Message-ID	<mailman.1560.1382744694.18130.python-list@python.org>
In reply to	#57487

On Thu, 24 Oct 2013 18:38:21 -0700 (PDT), Victor Hooi
<victorhooi@gmail.com> declaimed the following:

>
>For the reading, I'd
>
>    with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
>        csv_writer = DictWriter(output)
>        for line in DictReader(input):
>            # Do some processing for that line...
>            output = process_line(line)
>            # Write output to file
>            csv_writer.writerow(output)
>            
>So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good.
>
	My first comment would be: Do you really need the overhead of using
/dictionary/ reader/writer mode? Surely this processing isn't taking in
ad-hoc CSV files, is it? If all the inputs look the same using the regular
CSV reader may be faster (same for the writer) and your processing would be
using direct positional indexing rather than dictionary hashing/lookups.

>So if the output file is going to get large, there isn't anything I need to take into account for conserving memory?
>
	Memory is cheap -- I/O is slow. <G> Just how massive are these CSV
files?

>Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes.
>
>If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput?

	You are likely I/O bound, not CPU bound -- so multi-core doesn't really
affect things (neither would the GIL).

	You could maybe try using three Threads and a pair of Queues: reader,
process, writer. Limit the Queues to maybe 50-100 entries (tweak to liking
-- I run a task where using 100 entries results in a long "load" phase
before processing begins).

	Reader thread just reads entries from the input file and adds them to
the input queue. Processing thread takes entries from the input queue,
mashes them, and puts them onto the output queue. Writer thread, obviously,
takes items from the output queue and writes them to the file.

	Ideally, what should happen is, after a few seconds of the reader
monopolizing the CPU, you will have a backlog of records in the input
queue, and will initially block on "queue full", letting the processing
thread crunch some entries -- at some quantum the reader will get control,
find the queue is not full, and initiate the next read and blocks for the
I/O to complete; the processing thread gets control again and continues
number crunching. Hopefully, before the queue goes empty the reader will
complete the I/O and add the next entry to the queue. Same for output
queue.

	The main idea is that you don't block crunching while waiting for the
next record to be read or written. Crunching on blocks if the input runs
empty or the output fills up -- conditions in which the reader or writer
then gets control.

-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]

#57581

From	Roy Smith <roy@panix.com>
Date	2013-10-25 20:22 -0400
Message-ID	<roy-BE7CB6.20224425102013@news.panix.com>
In reply to	#57577

In article <mailman.1560.1382744694.18130.python-list@python.org>,
 Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:

> 	Memory is cheap -- I/O is slow. <G> Just how massive are these CSV
> files?

Actually, these days, the economics of hardware are more like, "CPU is 
cheap, memory is expensive".

I suppose it all depends on what kinds of problems you're solving, but 
my experience is I'm much more likely to run out of memory on big 
problems than I am to peg the CPU.  Also, pegging the CPU leads to 
well-behaved performance degradation.  Running out of memory leads to 
falling off a performance cliff as you start to page.

And, with the advent of large-scale SSD (you can get 1.6 TB SSD in 2.5 
inch form-factor!), I/O is as fast as you're willing to pay for :-)

[toc] | [prev] | [next] | [standalone]

#57597

From	Walter Hurry <walterhurry@lavabit.com>
Date	2013-10-26 08:53 +0000
Message-ID	<l4fvtf$792$1@news.albasani.net>
In reply to	#57487

On Thu, 24 Oct 2013 18:38:21 -0700, Victor Hooi wrote:

> Hi,
> 
> We have a directory of large CSV files that we'd like to process in
> Python.
> 
> We process each input CSV, then generate a corresponding output CSV
> file.
> 
> input CSV -> munging text, lookups etc. -> output CSV
> 
> My question is, what's the most Pythonic way of handling this? (Which
> I'm assuming
> 
> For the reading, I'd
> 
>     with open('input.csv', 'r') as input, open('output.csv', 'w') as
>     output:
>         csv_writer = DictWriter(output)
>         for line in DictReader(input):
>             # Do some processing for that line...
>             output = process_line(line)
>             # Write output to file csv_writer.writerow(output)
>             
> So for the reading, it'll iterates over the lines one by one, and won't
> read it into memory which is good.
> 
> For the writing - my understanding is that it writes a line to the file
> object each loop iteration, however, this will only get flushed to disk
> every now and then, based on my system default buffer size, right?
> 
> So if the output file is going to get large, there isn't anything I need
> to take into account for conserving memory?
> 
> Also, if I'm trying to maximise throughput of the above, is there
> anything I could try? The processing in process_line is quite line -
> just a bunch of string splits and regexes.
> 
> If I have multiple large CSV files to deal with, and I'm on a multi-core
> machine, is there anything else I can do to boost throughput?
> 
I'm guessing that the idea is to load the output CSV into a database.

If that's the case, why not load the input CSV into some kind of staging 
table in the database first, and do the processing there?

[toc] | [prev] | [standalone]

csiph-web

Processing large CSV files - how to maximise throughput?

Contents

#57487 — Processing large CSV files - how to maximise throughput?

#57488

#57491

#57494

#57495

#57497

#57499

#57514

#57516

#57490

#57493

#57577

#57581

#57597