Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #73491 > unrolled thread
| Started by | smurfix@gmail.com |
|---|---|
| First post | 2014-06-22 02:46 -0700 |
| Last post | 2014-06-24 00:16 +0100 |
| Articles | 10 — 5 participants |
Back to article view | Back to comp.lang.python
Python ORM library for distributed mostly-read-only objects? smurfix@gmail.com - 2014-06-22 02:46 -0700
Re: Python ORM library for distributed mostly-read-only objects? Roy Smith <roy@panix.com> - 2014-06-22 09:49 -0400
Re: Python ORM library for distributed mostly-read-only objects? smurfix@gmail.com - 2014-06-22 21:26 -0700
Re: Python ORM library for distributed mostly-read-only objects? William Ray Wing <wrw@mac.com> - 2014-06-23 09:43 -0400
Re: Python ORM library for distributed mostly-read-only objects? Roy Smith <roy@panix.com> - 2014-06-23 11:11 -0400
Re: Python ORM library for distributed mostly-read-only objects? smurfix@gmail.com - 2014-06-23 11:00 -0700
Re: Python ORM library for distributed mostly-read-only objects? Matthias Urlichs <matthias@urlichs.de> - 2014-06-23 19:42 +0200
Re: Python ORM library for distributed mostly-read-only objects? Lie Ryan <lie.1296@gmail.com> - 2014-06-23 16:54 +0100
Re: Python ORM library for distributed mostly-read-only objects? smurfix@gmail.com - 2014-06-23 11:05 -0700
Re: Python ORM library for distributed mostly-read-only objects? Lie Ryan <lie.1296@gmail.com> - 2014-06-24 00:16 +0100
| From | smurfix@gmail.com |
|---|---|
| Date | 2014-06-22 02:46 -0700 |
| Subject | Python ORM library for distributed mostly-read-only objects? |
| Message-ID | <85659fdd-511b-4aea-9c4b-17a4bbb88662@googlegroups.com> |
My problem: I have a large database of interconnected objects which I need to process with a combination of short- and long-lived workers. These objects are mostly read-only (i.e. any of them can be changed/marked-as-deleted, but that happens infrequently). The workers may or may not be within one Python process, or even on one system. I've been doing this with a "classic" session-based SQLAlchemy ORM, approach, but that ends up way too slow and memory intense, as each thread gets its own copy of every object it needs. I don't want that. My existing code does object loading and traversal by simple attribute access; I'd like to keep that if at all possible. Ideally, what I'd like to have is an object server which mediates write access to the database and then sends change/invalidation notices to the workers. (Changes are infrequent enough that I don't care if a worker gets a notice it's not interested in.) I don't care if updates are applied immediately or are only visible to the local process until committed. I also don't need fancy indexing or query abilities; if necessary I can go to the storage backend for that. (That should be SQL, though a NoSQL back-end would be nice to have.) Does something like this already exist, somewhere out there, or do I need to write this, or does somebody know of an alternate solution?
[toc] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-06-22 09:49 -0400 |
| Message-ID | <roy-4AD1D3.09495322062014@news.panix.com> |
| In reply to | #73491 |
In article <85659fdd-511b-4aea-9c4b-17a4bbb88662@googlegroups.com>, smurfix@gmail.com wrote: > My problem: I have a large database of interconnected objects which I need to > process with a combination of short- and long-lived workers. These objects > are mostly read-only (i.e. any of them can be changed/marked-as-deleted, but > that happens infrequently). The workers may or may not be within one Python > process, or even on one system. > > I've been doing this with a "classic" session-based SQLAlchemy ORM, approach, > but that ends up way too slow and memory intense, as each thread gets its own > copy of every object it needs. I don't want that. > > My existing code does object loading and traversal by simple attribute > access; I'd like to keep that if at all possible. > > Ideally, what I'd like to have is an object server which mediates write > access to the database and then sends change/invalidation notices to the > workers. (Changes are infrequent enough that I don't care if a worker gets a > notice it's not interested in.) > > I don't care if updates are applied immediately or are only visible to the > local process until committed. I also don't need fancy indexing or query > abilities; if necessary I can go to the storage backend for that. (That > should be SQL, though a NoSQL back-end would be nice to have.) > > Does something like this already exist, somewhere out there, or do I need to > write this, or does somebody know of an alternate solution? If you want to go NoSQL, I think what you're describing is a MongoDB replica set (http://docs.mongodb.org/manual/replication/). One of the replicas is the primary, to which all writes are directed. You can have some number of secondaries, which get all the changes applied to the primary, and spread out the load for read access. If you want a vaguely SQLAlchemy flavored ORM, there's mongoengine (http://mongoengine.org/). On the other hand, this may be overkill for what you're trying to do. Can you give us some more quantitative idea of your requirements? How many objects? How much total data is being stored? How many queries per second, and what is the acceptable latency for a query?
[toc] | [prev] | [next] | [standalone]
| From | smurfix@gmail.com |
|---|---|
| Date | 2014-06-22 21:26 -0700 |
| Message-ID | <b1167ac3-4735-4d35-9f21-71abc9e5fb46@googlegroups.com> |
| In reply to | #73497 |
On Sunday, June 22, 2014 3:49:53 PM UTC+2, Roy Smith wrote: > Can you give us some more quantitative idea of your requirements? How > many objects? How much total data is being stored? How many queries > per second, and what is the acceptable latency for a query? Not yet, A whole lot, More than fits in memory, That depends. To explain. The data is a network of diverse related objects. I can keep the most-used objects in memory but not all of them. Indeed, I _need_ to keep them, otherwise this will be too slow, even when using Mongo instead of SQLAlchemy. Which objects are "most-used" changes over time. I could work with MongoEngine by judicious hacking (augment DocumentField dereferencing with a local cache), but that leaves the update problem.
[toc] | [prev] | [next] | [standalone]
| From | William Ray Wing <wrw@mac.com> |
|---|---|
| Date | 2014-06-23 09:43 -0400 |
| Message-ID | <mailman.11202.1403534666.18130.python-list@python.org> |
| In reply to | #73510 |
On Jun 23, 2014, at 12:26 AM, smurfix@gmail.com wrote: > On Sunday, June 22, 2014 3:49:53 PM UTC+2, Roy Smith wrote: > >> Can you give us some more quantitative idea of your requirements? How >> many objects? How much total data is being stored? How many queries >> per second, and what is the acceptable latency for a query? > > Not yet, A whole lot, More than fits in memory, That depends. > > To explain. The data is a network of diverse related objects. I can keep the most-used objects in memory but not all of them. Indeed, I _need_ to keep them, otherwise this will be too slow, even when using Mongo instead of SQLAlchemy. Which objects are "most-used" changes over time. > Are you sure it won’t fit in memory? Default server memory configs these days tend to start at 128 Gig, and scale to 256 or 384 Gig. -Bill > I could work with MongoEngine by judicious hacking (augment DocumentField dereferencing with a local cache), but that leaves the update problem. > -- > https://mail.python.org/mailman/listinfo/python-list
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2014-06-23 11:11 -0400 |
| Message-ID | <roy-6910B9.11110623062014@news.panix.com> |
| In reply to | #73514 |
In article <mailman.11202.1403534666.18130.python-list@python.org>, William Ray Wing <wrw@mac.com> wrote: > On Jun 23, 2014, at 12:26 AM, smurfix@gmail.com wrote: > > > On Sunday, June 22, 2014 3:49:53 PM UTC+2, Roy Smith wrote: > > > >> Can you give us some more quantitative idea of your requirements? How > >> many objects? How much total data is being stored? How many queries > >> per second, and what is the acceptable latency for a query? > > > > Not yet, A whole lot, More than fits in memory, That depends. > > > > To explain. The data is a network of diverse related objects. I can keep > > the most-used objects in memory but not all of them. Indeed, I _need_ to > > keep them, otherwise this will be too slow, even when using Mongo instead > > of SQLAlchemy. Which objects are "most-used" changes over time. > > > > Are you sure it wonšt fit in memory? Default server memory configs these > days tend to start at 128 Gig, and scale to 256 or 384 Gig. I'm not sure what "default" means, but it's certainly possible to get machines with that much RAM. On the other hand, even the amount of RAM on a single machine is not really a limit. There are very easy to use technologies these days (i.e. memcache) which let you build clusters to effectively aggregate the physical RAM from multiple machines. And database sharding lets you do a different flavor of memory aggregation.
[toc] | [prev] | [next] | [standalone]
| From | smurfix@gmail.com |
|---|---|
| Date | 2014-06-23 11:00 -0700 |
| Message-ID | <c95e19ba-b219-4740-b0d9-221d8cac8055@googlegroups.com> |
| In reply to | #73515 |
memcache (or redis or ...) would be an option. However, I'm not going to go through the network plus deserialization for every object, that'd be too slow - thus I'd still need a local cache - which needs to be invalidated.
[toc] | [prev] | [next] | [standalone]
| From | Matthias Urlichs <matthias@urlichs.de> |
|---|---|
| Date | 2014-06-23 19:42 +0200 |
| Message-ID | <mailman.11204.1403546385.18130.python-list@python.org> |
| In reply to | #73510 |
Hi, William Ray Wing: > Are you sure it won’t fit in memory? Default server memory configs these days tend to start at 128 Gig, and scale to 256 or 384 Gig. > I am not going to buy a new server. I can justify writing a lot of custom code for that kind of money. Besides, the time to actually load all the data into memory beforehand would be prohibitive (so I'd still need a way to load referred data on demand), and the update problem remains. -- -- Matthias Urlichs
[toc] | [prev] | [next] | [standalone]
| From | Lie Ryan <lie.1296@gmail.com> |
|---|---|
| Date | 2014-06-23 16:54 +0100 |
| Message-ID | <mailman.11203.1403538899.18130.python-list@python.org> |
| In reply to | #73491 |
On 22/06/14 10:46, smurfix@gmail.com wrote: > > I've been doing this with a "classic" session-based SQLAlchemy ORM, approach, but that ends up way too slow and memory intense, as each thread gets its own copy of every object it needs. I don't want that. If you don't want each thread to have their own copy of the object, Don't use thread-scoped session. Use explicit scope instead.
[toc] | [prev] | [next] | [standalone]
| From | smurfix@gmail.com |
|---|---|
| Date | 2014-06-23 11:05 -0700 |
| Message-ID | <9030a8c2-2a11-4ea8-a9f0-c23d31e0d925@googlegroups.com> |
| In reply to | #73516 |
On Monday, June 23, 2014 5:54:38 PM UTC+2, Lie Ryan wrote: > If you don't want each thread to have their own copy of the object, > > Don't use thread-scoped session. Use explicit scope instead. How would that work when multiple threads traverse the in-memory object structure and cause relationships to be loaded? IIRC sqlalchemy's sessions are not thread safe.
[toc] | [prev] | [next] | [standalone]
| From | Lie Ryan <lie.1296@gmail.com> |
|---|---|
| Date | 2014-06-24 00:16 +0100 |
| Message-ID | <mailman.11206.1403565412.18130.python-list@python.org> |
| In reply to | #73519 |
On 23/06/14 19:05, smurfix@gmail.com wrote: > On Monday, June 23, 2014 5:54:38 PM UTC+2, Lie Ryan wrote: > >> If you don't want each thread to have their own copy of the object, >> >> Don't use thread-scoped session. Use explicit scope instead. > > How would that work when multiple threads traverse the in-memory object structure and cause relationships to be loaded? > IIRC sqlalchemy's sessions are not thread safe. You're going to have that problem anyway, if it is as you said that your problem is that you don't want each thread to have their own copy, then you cannot avoid having to deal with concurrent access. Note that SQLAlchemy objects can be used from multiple thread as long as it's not used concurrently and the underlying DBAPI is thread-safe (not all DBAPI supported by SQLAlchemy are thread safe). You can detach/expunge an SQLAlchemy object from the session to avoid unexpected loading of relationships. Alternatively, if you are not tied to SQLAlchemy nor SQL-based database, then you might want to check out ZODB's ZEO (http://www.zodb.org/en/latest/documentation/guide/zeo.html): > ZEO, Zope Enterprise Objects, extends the ZODB machinery to > provide access to objects over a network. ... ClientStorage > aggressively caches objects locally, so in order to avoid > using stale data the ZEO server sends an invalidation message > to all the connected ClientStorage instances on every write > operation. ... As a result, reads from the database are > far more frequent than writes, and ZEO is therefore better > suited for read-intensive applications. Warning: I had never used ZODB nor ZEO personally.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web