Groups > comp.lang.python > #73491 > unrolled thread

Python ORM library for distributed mostly-read-only objects?

Started by	smurfix@gmail.com
First post	2014-06-22 02:46 -0700
Last post	2014-06-24 00:16 +0100
Articles	10 — 5 participants

Back to article view | Back to comp.lang.python

  Python ORM library for distributed mostly-read-only objects? smurfix@gmail.com - 2014-06-22 02:46 -0700
    Re: Python ORM library for distributed mostly-read-only objects? Roy Smith <roy@panix.com> - 2014-06-22 09:49 -0400
      Re: Python ORM library for distributed mostly-read-only objects? smurfix@gmail.com - 2014-06-22 21:26 -0700
        Re: Python ORM library for distributed mostly-read-only objects? William Ray Wing <wrw@mac.com> - 2014-06-23 09:43 -0400
          Re: Python ORM library for distributed mostly-read-only objects? Roy Smith <roy@panix.com> - 2014-06-23 11:11 -0400
            Re: Python ORM library for distributed mostly-read-only objects? smurfix@gmail.com - 2014-06-23 11:00 -0700
        Re: Python ORM library for distributed mostly-read-only objects? Matthias Urlichs <matthias@urlichs.de> - 2014-06-23 19:42 +0200
    Re: Python ORM library for distributed mostly-read-only objects? Lie Ryan <lie.1296@gmail.com> - 2014-06-23 16:54 +0100
      Re: Python ORM library for distributed mostly-read-only objects? smurfix@gmail.com - 2014-06-23 11:05 -0700
        Re: Python ORM library for distributed mostly-read-only objects? Lie Ryan <lie.1296@gmail.com> - 2014-06-24 00:16 +0100

#73491 — Python ORM library for distributed mostly-read-only objects?

From	smurfix@gmail.com
Date	2014-06-22 02:46 -0700
Subject	Python ORM library for distributed mostly-read-only objects?
Message-ID	<85659fdd-511b-4aea-9c4b-17a4bbb88662@googlegroups.com>

My problem: I have a large database of interconnected objects which I need to process with a combination of short- and long-lived workers. These objects are mostly read-only (i.e. any of them can be changed/marked-as-deleted, but that happens infrequently). The workers may or may not be within one Python process, or even on one system.

I've been doing this with a "classic" session-based SQLAlchemy ORM, approach, but that ends up way too slow and memory intense, as each thread gets its own copy of every object it needs. I don't want that.

My existing code does object loading and traversal by simple attribute access; I'd like to keep that if at all possible.

Ideally, what I'd like to have is an object server which mediates write access to the database and then sends change/invalidation notices to the workers. (Changes are infrequent enough that I don't care if a worker gets a notice it's not interested in.)

I don't care if updates are applied immediately or are only visible to the local process until committed. I also don't need fancy indexing or query abilities; if necessary I can go to the storage backend for that. (That should be SQL, though a NoSQL back-end would be nice to have.)

Does something like this already exist, somewhere out there, or do I need to write this, or does somebody know of an alternate solution?

[toc] | [next] | [standalone]

#73497

From	Roy Smith <roy@panix.com>
Date	2014-06-22 09:49 -0400
Message-ID	<roy-4AD1D3.09495322062014@news.panix.com>
In reply to	#73491

In article <85659fdd-511b-4aea-9c4b-17a4bbb88662@googlegroups.com>,
 smurfix@gmail.com wrote:

> My problem: I have a large database of interconnected objects which I need to 
> process with a combination of short- and long-lived workers. These objects 
> are mostly read-only (i.e. any of them can be changed/marked-as-deleted, but 
> that happens infrequently). The workers may or may not be within one Python 
> process, or even on one system.
> 
> I've been doing this with a "classic" session-based SQLAlchemy ORM, approach, 
> but that ends up way too slow and memory intense, as each thread gets its own 
> copy of every object it needs. I don't want that.
> 
> My existing code does object loading and traversal by simple attribute 
> access; I'd like to keep that if at all possible.
> 
> Ideally, what I'd like to have is an object server which mediates write 
> access to the database and then sends change/invalidation notices to the 
> workers. (Changes are infrequent enough that I don't care if a worker gets a 
> notice it's not interested in.)
> 
> I don't care if updates are applied immediately or are only visible to the 
> local process until committed. I also don't need fancy indexing or query 
> abilities; if necessary I can go to the storage backend for that. (That 
> should be SQL, though a NoSQL back-end would be nice to have.)
> 
> Does something like this already exist, somewhere out there, or do I need to 
> write this, or does somebody know of an alternate solution?

If you want to go NoSQL, I think what you're describing is a MongoDB 
replica set (http://docs.mongodb.org/manual/replication/).  One of the 
replicas is the primary, to which all writes are directed.  You can have 
some number of secondaries, which get all the changes applied to the 
primary, and spread out the load for read access.  If you want a vaguely 
SQLAlchemy flavored ORM, there's mongoengine (http://mongoengine.org/).

On the other hand, this may be overkill for what you're trying to do.  
Can you give us some more quantitative idea of your requirements?  How 
many objects?  How much total data is being stored?  How many queries 
per second, and what is the acceptable latency for a query?

[toc] | [prev] | [next] | [standalone]

#73510

From	smurfix@gmail.com
Date	2014-06-22 21:26 -0700
Message-ID	<b1167ac3-4735-4d35-9f21-71abc9e5fb46@googlegroups.com>
In reply to	#73497

On Sunday, June 22, 2014 3:49:53 PM UTC+2, Roy Smith wrote:

> Can you give us some more quantitative idea of your requirements?  How 
> many objects?  How much total data is being stored?  How many queries 
> per second, and what is the acceptable latency for a query?

Not yet, A whole lot, More than fits in memory, That depends.

To explain. The data is a network of diverse related objects. I can keep the most-used objects in memory but not all of them. Indeed, I _need_ to keep them, otherwise this will be too slow, even when using Mongo instead of SQLAlchemy. Which objects are "most-used" changes over time.

I could work with MongoEngine by judicious hacking (augment DocumentField dereferencing with a local cache), but that leaves the update problem.

[toc] | [prev] | [next] | [standalone]

#73514

From	William Ray Wing <wrw@mac.com>
Date	2014-06-23 09:43 -0400
Message-ID	<mailman.11202.1403534666.18130.python-list@python.org>
In reply to	#73510

On Jun 23, 2014, at 12:26 AM, smurfix@gmail.com wrote:

> On Sunday, June 22, 2014 3:49:53 PM UTC+2, Roy Smith wrote:
> 
>> Can you give us some more quantitative idea of your requirements?  How 
>> many objects?  How much total data is being stored?  How many queries 
>> per second, and what is the acceptable latency for a query?
> 
> Not yet, A whole lot, More than fits in memory, That depends.
> 
> To explain. The data is a network of diverse related objects. I can keep the most-used objects in memory but not all of them. Indeed, I _need_ to keep them, otherwise this will be too slow, even when using Mongo instead of SQLAlchemy. Which objects are "most-used" changes over time.
> 

Are you sure it won’t fit in memory?  Default server memory configs these days tend to start at 128 Gig, and scale to 256 or 384 Gig.

-Bill


> I could work with MongoEngine by judicious hacking (augment DocumentField dereferencing with a local cache), but that leaves the update problem.
> -- 
> https://mail.python.org/mailman/listinfo/python-list

[toc] | [prev] | [next] | [standalone]

#73515

From	Roy Smith <roy@panix.com>
Date	2014-06-23 11:11 -0400
Message-ID	<roy-6910B9.11110623062014@news.panix.com>
In reply to	#73514

In article <mailman.11202.1403534666.18130.python-list@python.org>,
 William Ray Wing <wrw@mac.com> wrote:

> On Jun 23, 2014, at 12:26 AM, smurfix@gmail.com wrote:
> 
> > On Sunday, June 22, 2014 3:49:53 PM UTC+2, Roy Smith wrote:
> > 
> >> Can you give us some more quantitative idea of your requirements?  How 
> >> many objects?  How much total data is being stored?  How many queries 
> >> per second, and what is the acceptable latency for a query?
> > 
> > Not yet, A whole lot, More than fits in memory, That depends.
> > 
> > To explain. The data is a network of diverse related objects. I can keep 
> > the most-used objects in memory but not all of them. Indeed, I _need_ to 
> > keep them, otherwise this will be too slow, even when using Mongo instead 
> > of SQLAlchemy. Which objects are "most-used" changes over time.
> > 
> 
> Are you sure it wonšt fit in memory?  Default server memory configs these 
> days tend to start at 128 Gig, and scale to 256 or 384 Gig.

I'm not sure what "default" means, but it's certainly possible to get 
machines with that much RAM.  On the other hand, even the amount of RAM 
on a single machine is not really a limit.  There are very easy to use 
technologies these days (i.e. memcache) which let you build clusters to 
effectively aggregate the physical RAM from multiple machines.  And 
database sharding lets you do a different flavor of memory aggregation.

[toc] | [prev] | [next] | [standalone]

#73518

From	smurfix@gmail.com
Date	2014-06-23 11:00 -0700
Message-ID	<c95e19ba-b219-4740-b0d9-221d8cac8055@googlegroups.com>
In reply to	#73515

memcache (or redis or ...) would be an option. However, I'm not going to go through the network plus deserialization for every object, that'd be too slow - thus I'd still need a local cache - which needs to be invalidated.

[toc] | [prev] | [next] | [standalone]

#73517

From	Matthias Urlichs <matthias@urlichs.de>
Date	2014-06-23 19:42 +0200
Message-ID	<mailman.11204.1403546385.18130.python-list@python.org>
In reply to	#73510

Hi,

William Ray Wing:
> Are you sure it won’t fit in memory?  Default server memory configs these days tend to start at 128 Gig, and scale to 256 or 384 Gig.
> 
I am not going to buy a new server. I can justify writing a lot of custom
code for that kind of money.

Besides, the time to actually load all the data into memory beforehand
would be prohibitive (so I'd still need a way to load referred data on
demand), and the update problem remains.

-- 
-- Matthias Urlichs

[toc] | [prev] | [next] | [standalone]

#73516

From	Lie Ryan <lie.1296@gmail.com>
Date	2014-06-23 16:54 +0100
Message-ID	<mailman.11203.1403538899.18130.python-list@python.org>
In reply to	#73491

On 22/06/14 10:46, smurfix@gmail.com wrote:
>
> I've been doing this with a "classic" session-based SQLAlchemy ORM, approach, but that ends up way too slow and memory intense, as each thread gets its own copy of every object it needs. I don't want that.

If you don't want each thread to have their own copy of the object, 
Don't use thread-scoped session. Use explicit scope instead.

[toc] | [prev] | [next] | [standalone]

#73519

From	smurfix@gmail.com
Date	2014-06-23 11:05 -0700
Message-ID	<9030a8c2-2a11-4ea8-a9f0-c23d31e0d925@googlegroups.com>
In reply to	#73516

On Monday, June 23, 2014 5:54:38 PM UTC+2, Lie Ryan wrote:

> If you don't want each thread to have their own copy of the object, 
> 
> Don't use thread-scoped session. Use explicit scope instead.

How would that work when multiple threads traverse the in-memory object structure and cause relationships to be loaded?

IIRC sqlalchemy's sessions are not thread safe.

[toc] | [prev] | [next] | [standalone]

#73523

From	Lie Ryan <lie.1296@gmail.com>
Date	2014-06-24 00:16 +0100
Message-ID	<mailman.11206.1403565412.18130.python-list@python.org>
In reply to	#73519

On 23/06/14 19:05, smurfix@gmail.com wrote:
> On Monday, June 23, 2014 5:54:38 PM UTC+2, Lie Ryan wrote:
>
>> If you don't want each thread to have their own copy of the object,
>>
>> Don't use thread-scoped session. Use explicit scope instead.
>
> How would that work when multiple threads traverse the in-memory object structure and cause relationships to be loaded?
> IIRC sqlalchemy's sessions are not thread safe.

You're going to have that problem anyway, if it is as you said that your 
problem is that you don't want each thread to have their own copy, then 
you cannot avoid having to deal with concurrent access. Note that 
SQLAlchemy objects can be used from multiple thread as long as it's not 
used concurrently and the underlying DBAPI is thread-safe (not all DBAPI 
supported by SQLAlchemy are thread safe). You can detach/expunge an 
SQLAlchemy object from the session to avoid unexpected loading of 
relationships.

Alternatively, if you are not tied to SQLAlchemy nor SQL-based database, 
then you might want to check out ZODB's ZEO 
(http://www.zodb.org/en/latest/documentation/guide/zeo.html):

 > ZEO, Zope Enterprise Objects, extends the ZODB machinery to
 > provide access to objects over a network. ... ClientStorage
 > aggressively caches objects locally, so in order to avoid
 > using stale data the ZEO server sends an invalidation message
 > to all the connected ClientStorage instances on every write
 > operation. ...  As a result, reads from the database are
 > far more frequent than writes, and ZEO is therefore better
 > suited for read-intensive applications.

Warning: I had never used ZODB nor ZEO personally.

[toc] | [prev] | [standalone]

csiph-web

Python ORM library for distributed mostly-read-only objects?

Contents

#73491 — Python ORM library for distributed mostly-read-only objects?

#73497

#73510

#73514

#73515

#73518

#73517

#73516

#73519

#73523