Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.programming.threads > #2022
| From | Robert Wessel <robertwessel2@yahoo.com> |
|---|---|
| Newsgroups | comp.programming.threads |
| Subject | Re: Data copying on NUMA |
| Message-ID | <eqmn9956fmce2n8ua4s8k5233u9s5fr50a@4ax.com> (permalink) |
| References | <XnsA287445E6980myfirstnameosapriee@216.196.109.131> |
| Organization | Forte Inc. http://www.forteinc.com/apn/ |
| Date | 2013-12-01 19:21 -0600 |
On Thu, 28 Nov 2013 16:25:12 -0600, Paavo Helde <myfirstname@osa.pri.ee> wrote: > >On NUMA, as the acronym says, some memory is better accessible by a certain >NUMA node than others. Now, let's say I have a deep dynamic-allocated data >structure I want to use in a thread running in another NUMA node, how >should I pass it there? Should I perform the copy in the other thread so >that the new dynamic allocations take place in the target thread? Probably >depends on the memory allocator, what would be the best choice for >Linux/Windows? On Windows, for example, you can use the VirtualAllocExNuma API to allocate storage "near" a given node from any running code. By default the allocation will be in local memory for the allocating thread, which would not be optimal if another thread is going to do all the work on that area.. In general, allocating structure close to the executing node is certainly a win for some applications. Yes, that's vague, but so is your problem statement. If a thread (or group of threads), is going to heavily use a structure that won't cache, running all of those threads in a single NUMA node, and allocating that structure in memory local to that node, can significantly improve the available bandwidth and latency for those threads, as well as consuming less of the global resources for other threads in the system. >This is not a theoretical question, actually we see a large scaling >performance drop on NUMA and have to decide whether to go multi-process >somehow or are there some ways to make multi-threaded apps to behave >better. As far as I understand there is a hard limit of 64 worker threads >per process on Windows, so probably we have to go multi-process anyway at >some time point. Any insights or comments? There is no 64 threads per process limit in Windows. Perhaps such a thing existed in Win9x. Win32 has a limit of 32 logical cores per machine (which has nothing in particular to do with the number of threads in a process), but that doesn't apply to Win64 (although if you run Win32 applications on Win64 only the first 32 core in the first processor group are used to execute Win32 code). If you're running multiple processes, then the system can fairly easily split the workload between nodes, as you're implicitly telling the system that the data is not shared, so the system will try to keep a process (and hence it's allocations) on a particular node. If you're running enough threads in a process that you're going to span more than one node, you're going to have to specify some of that manually, or structure things so that allocations only happen on the appropriate node (usually by only doing the required allocations on the actual threads using the allocated areas).
Back to comp.programming.threads | Previous | Next — Previous in thread | Next in thread | Find similar
Data copying on NUMA Paavo Helde <myfirstname@osa.pri.ee> - 2013-11-28 16:25 -0600 Re: Data copying on NUMA Robert Wessel <robertwessel2@yahoo.com> - 2013-12-01 19:21 -0600 Re: Data copying on NUMA andrew@cucumber.demon.co.uk (Andrew Gabriel) - 2013-12-07 10:34 +0000
csiph-web