If you’ve ever written a data migration on a large table in Django, you may have run into problems running out of memory on the machine running the migration. This post is a collection of strategies for reducing memory usage during Django migrations. They’re ordered from simpler to more complex, so try the strategies at the top first.
In order to gain a rough sense of the magnitude of effect that each of these techniques can have, I created a toy dataset of blog posts and authors. The model looks like this:
class BlogPost(models.Model): title = models.CharField(max_length=255) author = models.CharField(max_length=60)
I inserted four million rows of test “blog posts” into a database using:
INSERT INTO blogpost_blogpost (title, author) SELECT md5(random()::text), md5(random()::text) FROM (SELECT generate_series(0, 4000000)) as t;
I’ll include an example using this dataset for each technique. I’m running them on a VM with 3GB of RAM.
Django querysets offer an
update() method which will update every item matched by the queryset in one database query. In other words, you can replace:
for obj in MyModel.objects.filter(owner=bob): obj.owner = sharon obj.save()
with something like
You’re just making a simple update to some of your model’s fields. In that case, calling
update() is far more efficient than iterating over a queryset with a for-loop.
This won’t work in all cases; if you need to make more complex changes to your data, or if you need to alter values on related objects,
update() won’t work.
The following runs out of memory:
In : for b in BlogPost.objects.all(): ...: b.author = "Me, myself, and I" ...: b.save() ...: Segmentation fault (core dumped)
But this doesn’t:
In : BlogPost.objects.all().update(author="Me, myself, and I") Out: 4000000
In fact, the Django shell’s memory usage doesn’t change during execution of the query.
Django querysets normally perform a certain amount of caching in the background; this is usually a good thing because it prevents unnecessary extra queries to the database. However, in the case of a migration on a table with a large number of rows, the queryset cache can use up a lot of memory.
To work around this, querysets offer an
iterator() function which will force the queryset not to cache any data. Since data migrations generally iterate through every row in the database exactly once, this is a perfect use-case for
To use it, change your migration function to look something like this:
for obj in MyModel.objects.iterator(): # do something with obj
You have a more complex use case that
update() can’t solve, and you’re sure that you only need to iterate through all your objects once.
iterator() causes the queryset not to cache any data, so it can hurt performance if used improperly. Think carefully before using it outside of migrations.
This runs to completion using the test data above. However, it still causes the Django shell process to allocate roughly 500MB of additional memory, and it’s much slower than the
for b in BlogPost.objects.all().iterator(): b.author = "Me, myself, and I" b.save()
It used to be that versions of Django prior to 1.8 would log all executed queries in memory, which would eventually cause the Django process to run out of memory during migrations which made numerous queries. This is less relevant now because Django 1.8 caps the number of logged queries at 9000. However, it may still be helpful in certain circumstances.
You’re running a Django version less than 1.8 or have really large objects or a very memory-constricted environment.
This only affects development environments because you shouldn’t be running with
DEBUG = True on production. You should also upgrade if you’re running Django less than 1.8; older versions of Django are no longer supported.
I tried running the previous
iterator() example with
DEBUG set to
True and then
False on Django 1.8 to see if there was a noticeable difference in memory usage, but there wasn’t. Perhaps with objects that were much larger it might make a difference, but it seems like Django 1.8 has made this a non-solution in many cases.
iterator() isn’t enough to prevent memory problems when dealing with large tables. This is most likely to happen when you have a more complex data model that causes each row to contain a lot of data.
In other words, instead of running a query similar to
SELECT * FROM myapp_mymodel;, using
only('field1') will result in SQL like
SELECT id, field1 from myapp_mymodel; (the primary key always gets selected so Django can keep track of which object is which). If you have a lot of fields on your model, or fields that result in a lot of data being loaded into memory, then modifying your migration to look something like this can save memory (and is likely to speed things up):
for obj in MyModel.objects.only('field1').iterator(): # do something with obj
In this case, the remaining fields on each MyModel instance are deferred, which means they will be loaded lazily by Django whenever they are accessed. If you never use these fields then they won’t be loaded at all.
.defer() does the opposite, causing only the fields passed to it to be deferred. For example,
defer('field1') will defer
field1, but no other fields.
Your migration uses only a small number of model fields and you’ve tried all the strategies above.
You need to be careful that your migration doesn’t inadvertently use any of the deferred fields; each time you do this, Django will make another query to the database to pull in the deferred data, which could cause a significant slowdown compared to loading all necessary model fields in one query.
I added a
defer to the previous iterator example:
for b in BlogPost.objects.all().defer('title').iterator(): b.author = "Me, myself, and I" b.save()
On my toy data set, deferring the unused field reduced the additional memory allocated by the Django shell process from 500MB to about 300MB, or a little less than a 50% reduction. This seems reasonable since each object consists of an integer and two strings of equal size, and we’re omitting one of the strings.
Depending on which database you’re using, even if you disable all Django data caching, you may still run into memory problems if the underlying database driver caches data. In this case, you may need to access the database driver directly in order to use a server-side cursor instead. Luckily, it looks like native support for server-side cursors is under active development, so this feature may be available in Django itself soon.
You have truly gigantic amounts of data and nothing else has worked.
In order to go this route you’ll have to drop into writing raw SQL, which can be less pleasant and harder to maintain than working through the Django ORM, so only try this as a last resort.
I wouldn’t recommend this except in extreme cases but there is a good example at the link below.
Want to join our team?
We’re committed to applying geospatial technology for civic and social impact while advancing the state-of-the-art through research.