Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #55415 > unrolled thread
| Started by | JL <lightaiyee@gmail.com> |
|---|---|
| First post | 2013-10-03 09:01 -0700 |
| Last post | 2013-10-04 02:42 +1000 |
| Articles | 11 — 6 participants |
Back to article view | Back to comp.lang.python
Multiple scripts versus single multi-threaded script JL <lightaiyee@gmail.com> - 2013-10-03 09:01 -0700
Re: Multiple scripts versus single multi-threaded script Roy Smith <roy@panix.com> - 2013-10-03 12:41 -0400
Re: Multiple scripts versus single multi-threaded script Chris Angelico <rosuav@gmail.com> - 2013-10-04 02:50 +1000
Re: Multiple scripts versus single multi-threaded script Roy Smith <roy@panix.com> - 2013-10-03 14:28 -0400
Re: Multiple scripts versus single multi-threaded script Chris Angelico <rosuav@gmail.com> - 2013-10-04 04:36 +1000
Re: Multiple scripts versus single multi-threaded script Roy Smith <roy@panix.com> - 2013-10-03 15:53 -0400
Re: Multiple scripts versus single multi-threaded script Chris Angelico <rosuav@gmail.com> - 2013-10-04 08:22 +1000
Re: Multiple scripts versus single multi-threaded script Dave Angel <davea@davea.name> - 2013-10-03 18:40 +0000
Re: Multiple scripts versus single multi-threaded script Jeremy Sanders <jeremy@jeremysanders.net> - 2013-10-04 10:02 +0200
Re: Multiple scripts versus single multi-threaded script Grant Edwards <invalid@invalid.invalid> - 2013-10-04 16:38 +0000
Re: Multiple scripts versus single multi-threaded script Chris Angelico <rosuav@gmail.com> - 2013-10-04 02:42 +1000
| From | JL <lightaiyee@gmail.com> |
|---|---|
| Date | 2013-10-03 09:01 -0700 |
| Subject | Multiple scripts versus single multi-threaded script |
| Message-ID | <f01b2e7a-9fc7-4138-bb6e-447d31179f2d@googlegroups.com> |
What is the difference between running multiple python scripts and a single multi-threaded script? May I know what are the pros and cons of each approach? Right now, my preference is to run multiple separate python scripts because it is simpler.
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-10-03 12:41 -0400 |
| Message-ID | <roy-451497.12415103102013@news.panix.com> |
| In reply to | #55415 |
In article <f01b2e7a-9fc7-4138-bb6e-447d31179f2d@googlegroups.com>, JL <lightaiyee@gmail.com> wrote: > What is the difference between running multiple python scripts and a single > multi-threaded script? May I know what are the pros and cons of each > approach? Right now, my preference is to run multiple separate python scripts > because it is simpler. First, let's take a step back and think about multi-threading vs. multi-processing in general (i.e. in any language). Threads are lighter-weight. That means it's faster to start a new thread (compared to starting a new process), and a thread consumes fewer system resources than a process. If you have lots of short-lived tasks to run, this can be significant. If each task will run for a long time and do a lot of computation, the cost of startup becomes less of an issue because it's amortized over the longer run time. Threads can communicate with each other in ways that processes can't. For example, file descriptors are shared by all the threads in a process, so one thread can open a file (or accept a network connection), then hand the descriptor off to another thread for processing. Threads also make it easy to share large amounts of data because they all have access to the same memory. You can do this between processes with shared memory segments, but it's more work to set up. The downside to threads is that all of of this sharing makes them much more complicated to use properly. You have to be aware of how all the threads are interacting, and mediate access to shared resources. If you do that wrong, you get memory corruption, deadlocks, and all sorts of (extremely) difficult to debug problems. A lot of the really hairy problems (i.e. things like one thread continuing to use memory which another thread has freed) are solved by using a high-level language like Python which handles all the memory allocation for you, but you can still get deadlocks and data corruption. So, the full answer to your question is very complicated. However, if you're looking for a short answer, I'd say just keep doing what you're doing using multiple processes and don't get into threading.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-10-04 02:50 +1000 |
| Message-ID | <mailman.684.1380819470.18130.python-list@python.org> |
| In reply to | #55421 |
On Fri, Oct 4, 2013 at 2:41 AM, Roy Smith <roy@panix.com> wrote: > The downside to threads is that all of of this sharing makes them much > more complicated to use properly. You have to be aware of how all the > threads are interacting, and mediate access to shared resources. If you > do that wrong, you get memory corruption, deadlocks, and all sorts of > (extremely) difficult to debug problems. A lot of the really hairy > problems (i.e. things like one thread continuing to use memory which > another thread has freed) are solved by using a high-level language like > Python which handles all the memory allocation for you, but you can > still get deadlocks and data corruption. With CPython, you don't have any headaches like that; you have one very simple protection, a Global Interpreter Lock (GIL), which guarantees that no two threads will execute Python code simultaneously. No corruption, no deadlocks, no hairy problems. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-10-03 14:28 -0400 |
| Message-ID | <roy-D617DD.14283203102013@news.panix.com> |
| In reply to | #55424 |
In article <mailman.684.1380819470.18130.python-list@python.org>,
Chris Angelico <rosuav@gmail.com> wrote:
> On Fri, Oct 4, 2013 at 2:41 AM, Roy Smith <roy@panix.com> wrote:
> > The downside to threads is that all of of this sharing makes them much
> > more complicated to use properly. You have to be aware of how all the
> > threads are interacting, and mediate access to shared resources. If you
> > do that wrong, you get memory corruption, deadlocks, and all sorts of
> > (extremely) difficult to debug problems. A lot of the really hairy
> > problems (i.e. things like one thread continuing to use memory which
> > another thread has freed) are solved by using a high-level language like
> > Python which handles all the memory allocation for you, but you can
> > still get deadlocks and data corruption.
>
> With CPython, you don't have any headaches like that; you have one
> very simple protection, a Global Interpreter Lock (GIL), which
> guarantees that no two threads will execute Python code
> simultaneously. No corruption, no deadlocks, no hairy problems.
>
> ChrisA
Well, the GIL certainly eliminates a whole range of problems, but it's
still possible to write code that deadlocks. All that's really needed
is for two threads to try to acquire the same two resources, in
different orders. I'm running the following code right now. It appears
to be doing a pretty good imitation of a deadlock. Any similarity to
current political events is purely intentional.
import threading
import time
lock1 = threading.Lock()
lock2 = threading.Lock()
class House(threading.Thread):
def run(self):
print "House starting..."
lock1.acquire()
time.sleep(1)
lock2.acquire()
print "House running"
lock2.release()
lock1.release()
class Senate(threading.Thread):
def run(self):
print "Senate starting..."
lock2.acquire()
time.sleep(1)
lock1.acquire()
print "Senate running"
lock1.release()
lock2.release()
h = House()
s = Senate()
h.start()
s.start()
Similarly, I can have data corruption. I can't get memory corruption in
the way you can get in a C/C++ program, but I can certainly have one
thread produce data for another thread to consume, and then
(incorrectly) continue to mutate that data after it relinquishes
ownership.
Let's say I have a Queue. A producer thread pushes work units onto the
Queue and a consumer thread pulls them off the other end. If my
producer thread does something like:
work = {'id': 1, 'data': "The Larch"}
my_queue.put(work)
work['id'] = 3
I've got a race condition where the consumer thread may get an id of
either 1 or 3, depending on exactly when it reads the data from its end
of the queue (more precisely, exactly when it uses that data).
Here's a somewhat different example of data corruption between threads:
import threading
import random
import sys
sketch = "The Dead Parrot"
class T1(threading.Thread):
def run(self):
current_sketch = str(sketch)
while 1:
if sketch != current_sketch:
print "Blimey, it's changed!"
return
class T2(threading.Thread):
def run(self):
sketches = ["Piranah Brothers",
"Spanish Enquisition",
"Lumberjack"]
while 1:
global sketch
sketch = random.choice(sketches)
t1 = T1()
t2 = T2()
t2.daemon = True
t1.start()
t2.start()
t1.join()
sys.exit()
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-10-04 04:36 +1000 |
| Message-ID | <mailman.691.1380825390.18130.python-list@python.org> |
| In reply to | #55436 |
On Fri, Oct 4, 2013 at 4:28 AM, Roy Smith <roy@panix.com> wrote: > Well, the GIL certainly eliminates a whole range of problems, but it's > still possible to write code that deadlocks. All that's really needed > is for two threads to try to acquire the same two resources, in > different orders. I'm running the following code right now. It appears > to be doing a pretty good imitation of a deadlock. Any similarity to > current political events is purely intentional. Right. Sorry, I meant that the GIL protects you from all that happening in the lower level code (even lower than the Senate, here), but yes, you can get deadlocks as soon as you acquire locks. That's nothing to do with threading, you can have the same issues with databases, file systems, or anything else that lets you lock something. It's a LOT easier to deal with deadlocks or data corruption that occurs in pure Python code than in C, since Python has awesome introspection facilities... and you're guaranteed that corrupt data is still valid Python objects. As to your corrupt data example, though, I'd advocate a very simple system of object ownership: as soon as the object has been put on the queue, it's "owned" by the recipient and shouldn't be mutated by anyone else. That kind of system generally isn't hard to maintain. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-10-03 15:53 -0400 |
| Message-ID | <roy-2604AC.15533803102013@news.panix.com> |
| In reply to | #55437 |
In article <mailman.691.1380825390.18130.python-list@python.org>, Chris Angelico <rosuav@gmail.com> wrote: > As to your corrupt data example, though, I'd advocate a very simple > system of object ownership: as soon as the object has been put on the > queue, it's "owned" by the recipient and shouldn't be mutated by > anyone else. Well, sure. I agree with you that threading in Python is about a zillion times easier to manage than threading in C/C++, but there are still things you need to think about when using threading in Python which you don't need to think about if you're not using threading at all. Transfer of ownership when you put something on a queue is one of those things. So, I think my original statement: > if you're looking for a short answer, I'd say just keep doing what > you're doing using multiple processes and don't get into threading. is still good advice for somebody who isn't sure they need threads. On the other hand, for somebody who is interested in learning about threads, Python is a great platform to learn because you get to experiment with the basic high-level concepts without getting bogged down in pthreads minutiae. And, as Chris pointed out, if you get it wrong, at least you've still got valid Python objects to puzzle over, not a smoking pile of bits on the floor.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-10-04 08:22 +1000 |
| Message-ID | <mailman.697.1380838966.18130.python-list@python.org> |
| In reply to | #55443 |
On Fri, Oct 4, 2013 at 5:53 AM, Roy Smith <roy@panix.com> wrote: > So, I think my original statement: > >> if you're looking for a short answer, I'd say just keep doing what >> you're doing using multiple processes and don't get into threading. > > is still good advice for somebody who isn't sure they need threads. > > On the other hand, for somebody who is interested in learning about > threads, Python is a great platform to learn because you get to > experiment with the basic high-level concepts without getting bogged > down in pthreads minutiae. And, as Chris pointed out, if you get it > wrong, at least you've still got valid Python objects to puzzle over, > not a smoking pile of bits on the floor. Agree wholeheartedly to both halves. I was just explaining a similar concept to my brother last night, with regard to network/database request handling: 1) The simplest code starts, executes, and finishes, with no threads, fork(), or other confusions.or shared state or anything. Execution can be completely predicted by eyeballing the source code. You can pretend that you have a dedicated CPU core that does nothing but run your program. 2) Threaded code adds a measure of complexity that you have to get your head around. Now you need to concern yourself with preemption, multiple threads doing things in different orders, locking, shared state, etc, etc. But you can still pretend that the execution of one job will happen as a single "thing", top down, with predictable intermediate state, if you like. (Python's threading and multiprocess modules both follow this style, they just have different levels of shared state.) 3) Asynchronous code adds significantly more "get your head around" complexity, since you now have to retain state for multiple jobs/requests in the same thread. You can't use local variables to keep track of where you're up to. Most likely, your code will do some tiny thing, update the state object for that request, fire off an asynchronous request of your own (maybe to the hard disk, with a callback when the data's read/written), and then return, back to some main loop. Now imagine you have a database written in style #1, and you have to drag it, kicking and screaming, into the 21st century. Oh look, it's easy! All you have to do is start multiple threads doing the same job! And then you'll have some problems with simultaneous edits, so you put some big fat locks all over the place to prevent two threads from doing the same thing at the same time. Even if one of those threads was handling something interactive and might hold its lock for some number of minutes. Suboptimal design, maybe, but hey, it works right? That's what my brother has to deal with every day, as a user of said database... :| ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-10-03 18:40 +0000 |
| Message-ID | <mailman.692.1380825636.18130.python-list@python.org> |
| In reply to | #55421 |
On 3/10/2013 12:50, Chris Angelico wrote:
> On Fri, Oct 4, 2013 at 2:41 AM, Roy Smith <roy@panix.com> wrote:
>> The downside to threads is that all of of this sharing makes them much
>> more complicated to use properly. You have to be aware of how all the
>> threads are interacting, and mediate access to shared resources. If you
>> do that wrong, you get memory corruption, deadlocks, and all sorts of
>> (extremely) difficult to debug problems. A lot of the really hairy
>> problems (i.e. things like one thread continuing to use memory which
>> another thread has freed) are solved by using a high-level language like
>> Python which handles all the memory allocation for you, but you can
>> still get deadlocks and data corruption.
>
> With CPython, you don't have any headaches like that; you have one
> very simple protection, a Global Interpreter Lock (GIL), which
> guarantees that no two threads will execute Python code
> simultaneously. No corruption, no deadlocks, no hairy problems.
>
> ChrisA
The GIL takes care of the gut-level interpreter issues like reference
counts for shared objects. But it does not avoid deadlock or hairy
problems. I'll just show one, trivial, problem, but many others exist.
If two threads process the same global variable as follows,
myglobal = myglobal + 1
Then you have no guarantee that the value will really get incremented
twice. Presumably there's a mutex/critsection function in the threading
module that can make this safe, but once you use it in two different
places, you raise the possibility of deadlock.
On the other hand, if you're careful to have the thread use only data
that is unique to that thread, then it would seem to be safe. However,
you still have the same risk if you call some library that wasn't
written to be thread safe. I'll assume that print() and suchlike are
safe, but some third party library could well use the equivalent of a
global variable in an unsafe way.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Jeremy Sanders <jeremy@jeremysanders.net> |
|---|---|
| Date | 2013-10-04 10:02 +0200 |
| Message-ID | <mailman.710.1380873727.18130.python-list@python.org> |
| In reply to | #55421 |
Roy Smith wrote: > Threads are lighter-weight. That means it's faster to start a new > thread (compared to starting a new process), and a thread consumes fewer > system resources than a process. If you have lots of short-lived tasks > to run, this can be significant. If each task will run for a long time > and do a lot of computation, the cost of startup becomes less of an > issue because it's amortized over the longer run time. This might be true on Windows, but I think on Linux process overheads are pretty similar to threads, e.g. http://stackoverflow.com/questions/807506/threads-vs-processes-in-linux Combined with the lack of a GIL-conflict, processes can be pretty efficient. Jeremy
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2013-10-04 16:38 +0000 |
| Message-ID | <l2mqt7$q4u$1@reader1.panix.com> |
| In reply to | #55421 |
On 2013-10-03, Roy Smith <roy@panix.com> wrote:
> Threads are lighter-weight. That means it's faster to start a new
> thread (compared to starting a new process), and a thread consumes
> fewer system resources than a process.
That's true, but the extent to which it's true varies considerably
from one OS to another. Starting processes is typically very cheap on
Unix systems. On Linux a thread and a process are actually both
started by the same system call, and the only significant difference
is how some of the new page descriptors are set up (they're
copy-on-write instead of shared).
On other OSes, starting a process is _way_ more expensive/slow than
starting a thread. That was very true for VMS, so one suspects it
might also be true for its stepchild MS-Window.
--
Grant Edwards grant.b.edwards Yow! RELATIVES!!
at
gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-10-04 02:42 +1000 |
| Message-ID | <mailman.682.1380818535.18130.python-list@python.org> |
| In reply to | #55415 |
On Fri, Oct 4, 2013 at 2:01 AM, JL <lightaiyee@gmail.com> wrote: > What is the difference between running multiple python scripts and a single multi-threaded script? May I know what are the pros and cons of each approach? Right now, my preference is to run multiple separate python scripts because it is simpler. (Caveat: The below is based on CPython. If you're using IronPython, Jython, or some other implementation, some details may be a little different.) Multiple threads can share state easily by simply referencing each other's variables, but the cost of that is that they'll never actually execute simultaneously. If you want your scripts to run in parallel on multiple CPUs/cores, you need multiple processes. But if you're doing something I/O bound (like servicing sockets), threads work just fine. As to using separate scripts versus the multiprocessing module, that's purely a matter of what looks cleanest. Do whatever suits your code. ChrisA
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web