Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #55415 > unrolled thread

Multiple scripts versus single multi-threaded script

Started byJL <lightaiyee@gmail.com>
First post2013-10-03 09:01 -0700
Last post2013-10-04 02:42 +1000
Articles 11 — 6 participants

Back to article view | Back to comp.lang.python


Contents

  Multiple scripts versus single multi-threaded script JL <lightaiyee@gmail.com> - 2013-10-03 09:01 -0700
    Re: Multiple scripts versus single multi-threaded script Roy Smith <roy@panix.com> - 2013-10-03 12:41 -0400
      Re: Multiple scripts versus single multi-threaded script Chris Angelico <rosuav@gmail.com> - 2013-10-04 02:50 +1000
        Re: Multiple scripts versus single multi-threaded script Roy Smith <roy@panix.com> - 2013-10-03 14:28 -0400
          Re: Multiple scripts versus single multi-threaded script Chris Angelico <rosuav@gmail.com> - 2013-10-04 04:36 +1000
            Re: Multiple scripts versus single multi-threaded script Roy Smith <roy@panix.com> - 2013-10-03 15:53 -0400
              Re: Multiple scripts versus single multi-threaded script Chris Angelico <rosuav@gmail.com> - 2013-10-04 08:22 +1000
      Re: Multiple scripts versus single multi-threaded script Dave Angel <davea@davea.name> - 2013-10-03 18:40 +0000
      Re: Multiple scripts versus single multi-threaded script Jeremy Sanders <jeremy@jeremysanders.net> - 2013-10-04 10:02 +0200
      Re: Multiple scripts versus single multi-threaded script Grant Edwards <invalid@invalid.invalid> - 2013-10-04 16:38 +0000
    Re: Multiple scripts versus single multi-threaded script Chris Angelico <rosuav@gmail.com> - 2013-10-04 02:42 +1000

#55415 — Multiple scripts versus single multi-threaded script

FromJL <lightaiyee@gmail.com>
Date2013-10-03 09:01 -0700
SubjectMultiple scripts versus single multi-threaded script
Message-ID<f01b2e7a-9fc7-4138-bb6e-447d31179f2d@googlegroups.com>
What is the difference between running multiple python scripts and a single multi-threaded script? May I know what are the pros and cons of each approach? Right now, my preference is to run multiple separate python scripts because it is simpler.

[toc] | [next] | [standalone]


#55421

FromRoy Smith <roy@panix.com>
Date2013-10-03 12:41 -0400
Message-ID<roy-451497.12415103102013@news.panix.com>
In reply to#55415
In article <f01b2e7a-9fc7-4138-bb6e-447d31179f2d@googlegroups.com>,
 JL <lightaiyee@gmail.com> wrote:

> What is the difference between running multiple python scripts and a single 
> multi-threaded script? May I know what are the pros and cons of each 
> approach? Right now, my preference is to run multiple separate python scripts 
> because it is simpler.

First, let's take a step back and think about multi-threading vs. 
multi-processing in general (i.e. in any language).

Threads are lighter-weight.  That means it's faster to start a new 
thread (compared to starting a new process), and a thread consumes fewer 
system resources than a process.  If you have lots of short-lived tasks 
to run, this can be significant.  If each task will run for a long time 
and do a lot of computation, the cost of startup becomes less of an 
issue because it's amortized over the longer run time.

Threads can communicate with each other in ways that processes can't.  
For example, file descriptors are shared by all the threads in a 
process, so one thread can open a file (or accept a network connection), 
then hand the descriptor off to another thread for processing.  Threads 
also make it easy to share large amounts of data because they all have 
access to the same memory.  You can do this between processes with 
shared memory segments, but it's more work to set up.

The downside to threads is that all of of this sharing makes them much 
more complicated to use properly.  You have to be aware of how all the 
threads are interacting, and mediate access to shared resources.  If you 
do that wrong, you get memory corruption, deadlocks, and all sorts of 
(extremely) difficult to debug problems.  A lot of the really hairy 
problems (i.e. things like one thread continuing to use memory which 
another thread has freed) are solved by using a high-level language like 
Python which handles all the memory allocation for you, but you can 
still get deadlocks and data corruption.

So, the full answer to your question is very complicated.  However, if 
you're looking for a short answer, I'd say just keep doing what you're 
doing using multiple processes and don't get into threading.

[toc] | [prev] | [next] | [standalone]


#55424

FromChris Angelico <rosuav@gmail.com>
Date2013-10-04 02:50 +1000
Message-ID<mailman.684.1380819470.18130.python-list@python.org>
In reply to#55421
On Fri, Oct 4, 2013 at 2:41 AM, Roy Smith <roy@panix.com> wrote:
> The downside to threads is that all of of this sharing makes them much
> more complicated to use properly.  You have to be aware of how all the
> threads are interacting, and mediate access to shared resources.  If you
> do that wrong, you get memory corruption, deadlocks, and all sorts of
> (extremely) difficult to debug problems.  A lot of the really hairy
> problems (i.e. things like one thread continuing to use memory which
> another thread has freed) are solved by using a high-level language like
> Python which handles all the memory allocation for you, but you can
> still get deadlocks and data corruption.

With CPython, you don't have any headaches like that; you have one
very simple protection, a Global Interpreter Lock (GIL), which
guarantees that no two threads will execute Python code
simultaneously. No corruption, no deadlocks, no hairy problems.

ChrisA

[toc] | [prev] | [next] | [standalone]


#55436

FromRoy Smith <roy@panix.com>
Date2013-10-03 14:28 -0400
Message-ID<roy-D617DD.14283203102013@news.panix.com>
In reply to#55424
In article <mailman.684.1380819470.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> On Fri, Oct 4, 2013 at 2:41 AM, Roy Smith <roy@panix.com> wrote:
> > The downside to threads is that all of of this sharing makes them much
> > more complicated to use properly.  You have to be aware of how all the
> > threads are interacting, and mediate access to shared resources.  If you
> > do that wrong, you get memory corruption, deadlocks, and all sorts of
> > (extremely) difficult to debug problems.  A lot of the really hairy
> > problems (i.e. things like one thread continuing to use memory which
> > another thread has freed) are solved by using a high-level language like
> > Python which handles all the memory allocation for you, but you can
> > still get deadlocks and data corruption.
> 
> With CPython, you don't have any headaches like that; you have one
> very simple protection, a Global Interpreter Lock (GIL), which
> guarantees that no two threads will execute Python code
> simultaneously. No corruption, no deadlocks, no hairy problems.
> 
> ChrisA

Well, the GIL certainly eliminates a whole range of problems, but it's 
still possible to write code that deadlocks.  All that's really needed 
is for two threads to try to acquire the same two resources, in 
different orders.  I'm running the following code right now.  It appears 
to be doing a pretty good imitation of a deadlock.  Any similarity to 
current political events is purely intentional.

import threading
import time

lock1 = threading.Lock()
lock2 = threading.Lock()

class House(threading.Thread):
    def run(self):
        print "House starting..."
        lock1.acquire()
        time.sleep(1)
        lock2.acquire()
        print "House running"
        lock2.release()
        lock1.release()

class Senate(threading.Thread):
    def run(self):
        print "Senate starting..."
        lock2.acquire()
        time.sleep(1)
        lock1.acquire()
        print "Senate running"
        lock1.release()
        lock2.release()

h = House()
s = Senate()

h.start()
s.start()

Similarly, I can have data corruption.  I can't get memory corruption in 
the way you can get in a C/C++ program, but I can certainly have one 
thread produce data for another thread to consume, and then 
(incorrectly) continue to mutate that data after it relinquishes 
ownership.

Let's say I have a Queue.  A producer thread pushes work units onto the 
Queue and a consumer thread pulls them off the other end.  If my 
producer thread does something like:

work = {'id': 1, 'data': "The Larch"}
my_queue.put(work)
work['id'] = 3

I've got a race condition where the consumer thread may get an id of 
either 1 or 3, depending on exactly when it reads the data from its end 
of the queue (more precisely, exactly when it uses that data).

Here's a somewhat different example of data corruption between threads:

import threading
import random
import sys

sketch = "The Dead Parrot"

class T1(threading.Thread):
    def run(self):
        current_sketch = str(sketch)
        while 1:
            if sketch != current_sketch:
                print "Blimey, it's changed!"
                return

class T2(threading.Thread):
    def run(self):
        sketches = ["Piranah Brothers",
                    "Spanish Enquisition",
                    "Lumberjack"]
        while 1:
            global sketch
            sketch = random.choice(sketches)

t1 = T1()
t2 = T2()
t2.daemon = True

t1.start()
t2.start()

t1.join()
sys.exit()

[toc] | [prev] | [next] | [standalone]


#55437

FromChris Angelico <rosuav@gmail.com>
Date2013-10-04 04:36 +1000
Message-ID<mailman.691.1380825390.18130.python-list@python.org>
In reply to#55436
On Fri, Oct 4, 2013 at 4:28 AM, Roy Smith <roy@panix.com> wrote:
> Well, the GIL certainly eliminates a whole range of problems, but it's
> still possible to write code that deadlocks.  All that's really needed
> is for two threads to try to acquire the same two resources, in
> different orders.  I'm running the following code right now.  It appears
> to be doing a pretty good imitation of a deadlock.  Any similarity to
> current political events is purely intentional.

Right. Sorry, I meant that the GIL protects you from all that
happening in the lower level code (even lower than the Senate, here),
but yes, you can get deadlocks as soon as you acquire locks. That's
nothing to do with threading, you can have the same issues with
databases, file systems, or anything else that lets you lock
something. It's a LOT easier to deal with deadlocks or data corruption
that occurs in pure Python code than in C, since Python has awesome
introspection facilities... and you're guaranteed that corrupt data is
still valid Python objects.

As to your corrupt data example, though, I'd advocate a very simple
system of object ownership: as soon as the object has been put on the
queue, it's "owned" by the recipient and shouldn't be mutated by
anyone else. That kind of system generally isn't hard to maintain.

ChrisA

[toc] | [prev] | [next] | [standalone]


#55443

FromRoy Smith <roy@panix.com>
Date2013-10-03 15:53 -0400
Message-ID<roy-2604AC.15533803102013@news.panix.com>
In reply to#55437
In article <mailman.691.1380825390.18130.python-list@python.org>,
 Chris Angelico <rosuav@gmail.com> wrote:

> As to your corrupt data example, though, I'd advocate a very simple
> system of object ownership: as soon as the object has been put on the
> queue, it's "owned" by the recipient and shouldn't be mutated by
> anyone else.

Well, sure.  I agree with you that threading in Python is about a 
zillion times easier to manage than threading in C/C++, but there are 
still things you need to think about when using threading in Python 
which you don't need to think about if you're not using threading at 
all.  Transfer of ownership when you put something on a queue is one of 
those things.

So, I think my original statement:

> if you're looking for a short answer, I'd say just keep doing what 
> you're doing using multiple processes and don't get into threading.

is still good advice for somebody who isn't sure they need threads.

On the other hand, for somebody who is interested in learning about 
threads, Python is a great platform to learn because you get to 
experiment with the basic high-level concepts without getting bogged 
down in pthreads minutiae.  And, as Chris pointed out, if you get it 
wrong, at least you've still got valid Python objects to puzzle over, 
not a smoking pile of bits on the floor.

[toc] | [prev] | [next] | [standalone]


#55445

FromChris Angelico <rosuav@gmail.com>
Date2013-10-04 08:22 +1000
Message-ID<mailman.697.1380838966.18130.python-list@python.org>
In reply to#55443
On Fri, Oct 4, 2013 at 5:53 AM, Roy Smith <roy@panix.com> wrote:
> So, I think my original statement:
>
>> if you're looking for a short answer, I'd say just keep doing what
>> you're doing using multiple processes and don't get into threading.
>
> is still good advice for somebody who isn't sure they need threads.
>
> On the other hand, for somebody who is interested in learning about
> threads, Python is a great platform to learn because you get to
> experiment with the basic high-level concepts without getting bogged
> down in pthreads minutiae.  And, as Chris pointed out, if you get it
> wrong, at least you've still got valid Python objects to puzzle over,
> not a smoking pile of bits on the floor.

Agree wholeheartedly to both halves. I was just explaining a similar
concept to my brother last night, with regard to network/database
request handling:

1) The simplest code starts, executes, and finishes, with no threads,
fork(), or other confusions.or shared state or anything. Execution can
be completely predicted by eyeballing the source code. You can pretend
that you have a dedicated CPU core that does nothing but run your
program.

2) Threaded code adds a measure of complexity that you have to get
your head around. Now you need to concern yourself with preemption,
multiple threads doing things in different orders, locking, shared
state, etc, etc. But you can still pretend that the execution of one
job will happen as a single "thing", top down, with predictable
intermediate state, if you like. (Python's threading and multiprocess
modules both follow this style, they just have different levels of
shared state.)

3) Asynchronous code adds significantly more "get your head around"
complexity, since you now have to retain state for multiple
jobs/requests in the same thread. You can't use local variables to
keep track of where you're up to. Most likely, your code will do some
tiny thing, update the state object for that request, fire off an
asynchronous request of your own (maybe to the hard disk, with a
callback when the data's read/written), and then return, back to some
main loop.

Now imagine you have a database written in style #1, and you have to
drag it, kicking and screaming, into the 21st century. Oh look, it's
easy! All you have to do is start multiple threads doing the same job!
And then you'll have some problems with simultaneous edits, so you put
some big fat locks all over the place to prevent two threads from
doing the same thing at the same time. Even if one of those threads
was handling something interactive and might hold its lock for some
number of minutes. Suboptimal design, maybe, but hey, it works right?
That's what my brother has to deal with every day, as a user of said
database... :|

ChrisA

[toc] | [prev] | [next] | [standalone]


#55438

FromDave Angel <davea@davea.name>
Date2013-10-03 18:40 +0000
Message-ID<mailman.692.1380825636.18130.python-list@python.org>
In reply to#55421
On 3/10/2013 12:50, Chris Angelico wrote:

> On Fri, Oct 4, 2013 at 2:41 AM, Roy Smith <roy@panix.com> wrote:
>> The downside to threads is that all of of this sharing makes them much
>> more complicated to use properly.  You have to be aware of how all the
>> threads are interacting, and mediate access to shared resources.  If you
>> do that wrong, you get memory corruption, deadlocks, and all sorts of
>> (extremely) difficult to debug problems.  A lot of the really hairy
>> problems (i.e. things like one thread continuing to use memory which
>> another thread has freed) are solved by using a high-level language like
>> Python which handles all the memory allocation for you, but you can
>> still get deadlocks and data corruption.
>
> With CPython, you don't have any headaches like that; you have one
> very simple protection, a Global Interpreter Lock (GIL), which
> guarantees that no two threads will execute Python code
> simultaneously. No corruption, no deadlocks, no hairy problems.
>
> ChrisA

The GIL takes care of the gut-level interpreter issues like reference
counts for shared objects.  But it does not avoid deadlock or hairy
problems.  I'll just show one, trivial, problem, but many others exist.

If two threads process the same global variable as follows,
    myglobal = myglobal + 1

Then you have no guarantee that the value will really get incremented
twice.  Presumably there's a mutex/critsection function in the threading
module that can make this safe, but once you use it in two different
places, you raise the possibility of deadlock.

On the other hand, if you're careful to have the thread use only data
that is unique to that thread, then it would seem to be safe.  However,
you still have the same risk if you call some library that wasn't
written to be thread safe.  I'll assume that print() and suchlike are
safe, but some third party library could well use the equivalent of a
global variable in an unsafe way.



-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#55455

FromJeremy Sanders <jeremy@jeremysanders.net>
Date2013-10-04 10:02 +0200
Message-ID<mailman.710.1380873727.18130.python-list@python.org>
In reply to#55421
Roy Smith wrote:

> Threads are lighter-weight.  That means it's faster to start a new
> thread (compared to starting a new process), and a thread consumes fewer
> system resources than a process.  If you have lots of short-lived tasks
> to run, this can be significant.  If each task will run for a long time
> and do a lot of computation, the cost of startup becomes less of an
> issue because it's amortized over the longer run time.

This might be true on Windows, but I think on Linux process overheads are 
pretty similar to threads, e.g.
http://stackoverflow.com/questions/807506/threads-vs-processes-in-linux

Combined with the lack of a GIL-conflict, processes can be pretty efficient.

Jeremy

[toc] | [prev] | [next] | [standalone]


#55483

FromGrant Edwards <invalid@invalid.invalid>
Date2013-10-04 16:38 +0000
Message-ID<l2mqt7$q4u$1@reader1.panix.com>
In reply to#55421
On 2013-10-03, Roy Smith <roy@panix.com> wrote:

> Threads are lighter-weight.  That means it's faster to start a new 
> thread (compared to starting a new process), and a thread consumes
> fewer system resources than a process.

That's true, but the extent to which it's true varies considerably
from one OS to another.  Starting processes is typically very cheap on
Unix systems.  On Linux a thread and a process are actually both
started by the same system call, and the only significant difference
is how some of the new page descriptors are set up (they're
copy-on-write instead of shared).

On other OSes, starting a process is _way_ more expensive/slow than
starting a thread.  That was very true for VMS, so one suspects it
might also be true for its stepchild MS-Window.

-- 
Grant Edwards               grant.b.edwards        Yow! RELATIVES!!
                                  at               
                              gmail.com            

[toc] | [prev] | [next] | [standalone]


#55422

FromChris Angelico <rosuav@gmail.com>
Date2013-10-04 02:42 +1000
Message-ID<mailman.682.1380818535.18130.python-list@python.org>
In reply to#55415
On Fri, Oct 4, 2013 at 2:01 AM, JL <lightaiyee@gmail.com> wrote:
> What is the difference between running multiple python scripts and a single multi-threaded script? May I know what are the pros and cons of each approach? Right now, my preference is to run multiple separate python scripts because it is simpler.

(Caveat: The below is based on CPython. If you're using IronPython,
Jython, or some other implementation, some details may be a little
different.)

Multiple threads can share state easily by simply referencing each
other's variables, but the cost of that is that they'll never actually
execute simultaneously. If you want your scripts to run in parallel on
multiple CPUs/cores, you need multiple processes. But if you're doing
something I/O bound (like servicing sockets), threads work just fine.

As to using separate scripts versus the multiprocessing module, that's
purely a matter of what looks cleanest. Do whatever suits your code.

ChrisA

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web