Strategies for reducing memory usage in Django migrations

Strategies for reducing memory usage in Django migrations

If you’ve ever written a data migration on a large table in Django, you may have run into problems running out of memory on the machine running the migration. This post is a collection of strategies for reducing memory usage during Django migrations. They’re ordered from simpler to more complex, so try the strategies at the top first.

In order to gain a rough sense of the magnitude of effect that each of these techniques can have, I created a toy dataset of blog posts and authors. The model looks like this:

class BlogPost(models.Model):
    title = models.CharField(max_length=255)
    author = models.CharField(max_length=60)

I inserted four million rows of test “blog posts” into a database using:

INSERT INTO blogpost_blogpost (title, author) SELECT md5(random()::text), md5(random()::text)
FROM (SELECT generate_series(0, 4000000)) as t;

I’ll include an example using this dataset for each technique. I’m running them on a VM with 3GB of RAM.

Use update() instead of a loop

Django querysets offer an update() method which will update every item matched by the queryset in one database query. In other words, you can replace:

for obj in MyModel.objects.filter(owner=bob):
    obj.owner = sharon
    obj.save()

with something like MyModel.objects.filter(owner=bob).update(owner=sharon).

Try this if …

You’re just making a simple update to some of your model’s fields. In that case, calling update() is far more efficient than iterating over a queryset with a for-loop.

Warnings

This won’t work in all cases; if you need to make more complex changes to your data, or if you need to alter values on related objects, update() won’t work.

Example

The following runs out of memory:

In [1]: for b in BlogPost.objects.all():
...: b.author = "Me, myself, and I"
...: b.save()
...:
Segmentation fault (core dumped)

But this doesn’t:

In [1]: BlogPost.objects.all().update(author="Me, myself, and I")
Out[1]: 4000000

In fact, the Django shell’s memory usage doesn’t change during execution of the query.

Use iterator()

Django querysets normally perform a certain amount of caching in the background; this is usually a good thing because it prevents unnecessary extra queries to the database. However, in the case of a migration on a table with a large number of rows, the queryset cache can use up a lot of memory.

To work around this, querysets offer an iterator() function which will force the queryset not to cache any data. Since data migrations generally iterate through every row in the database exactly once, this is a perfect use-case for iterator().

Want to work on projects with a social and civic impact? Learn what it’s like to work at Azavea.

Visit our career site

To use it, change your migration function to look something like this:

for obj in MyModel.objects.iterator():
    # do something with obj

Try this if …

You have a more complex use case that update() can’t solve, and you’re sure that you only need to iterate through all your objects once.

Warnings

Using iterator() causes the queryset not to cache any data, so it can hurt performance if used improperly. Think carefully before using it outside of migrations.

Example

This runs to completion using the test data above. However, it still causes the Django shell process to allocate roughly 500MB of additional memory, and it’s much slower than the update() method.

for b in BlogPost.objects.all().iterator():
    b.author = "Me, myself, and I"
    b.save()

Set settings.DEBUG = False

It used to be that versions of Django prior to 1.8 would log all executed queries in memory, which would eventually cause the Django process to run out of memory during migrations which made numerous queries. This is less relevant now because Django 1.8 caps the number of logged queries at 9000. However, it may still be helpful in certain circumstances.

Try this if …

You’re running a Django version less than 1.8 or have really large objects or a very memory-constricted environment.

Warnings

This only affects development environments because you shouldn’t be running with DEBUG = True on production. You should also upgrade if you’re running Django less than 1.8; older versions of Django are no longer supported.

Example

I tried running the previous iterator() example with DEBUG set to True and then False on Django 1.8 to see if there was a noticeable difference in memory usage, but there wasn’t. Perhaps with objects that were much larger it might make a difference, but it seems like Django 1.8 has made this a non-solution in many cases.

Use only() or defer()

Sometimes, even iterator() isn’t enough to prevent memory problems when dealing with large tables. This is most likely to happen when you have a more complex data model that causes each row to contain a lot of data.

In these cases, you can use only() and defer() on your queryset to limit the database columns that Django loads from the database.

In other words, instead of running a query similar to SELECT * FROM myapp_mymodel;, using only('field1') will result in SQL like SELECT id, field1 from myapp_mymodel; (the primary key always gets selected so Django can keep track of which object is which). If you have a lot of fields on your model, or fields that result in a lot of data being loaded into memory, then modifying your migration to look something like this can save memory (and is likely to speed things up):

for obj in MyModel.objects.only('field1').iterator():
    # do something with obj

In this case, the remaining fields on each MyModel instance are deferred, which means they will be loaded lazily by Django whenever they are accessed. If you never use these fields then they won’t be loaded at all.

.defer() does the opposite, causing only the fields passed to it to be deferred. For example, defer('field1') will defer field1, but no other fields.

Try this if …

Your migration uses only a small number of model fields and you’ve tried all the strategies above.

Warnings

You need to be careful that your migration doesn’t inadvertently use any of the deferred fields; each time you do this, Django will make another query to the database to pull in the deferred data, which could cause a significant slowdown compared to loading all necessary model fields in one query.

Example

I added a defer to the previous iterator example:

for b in BlogPost.objects.all().defer('title').iterator():
    b.author = "Me, myself, and I"
    b.save()

On my toy data set, deferring the unused field reduced the additional memory allocated by the Django shell process from 500MB to about 300MB, or a little less than a 50% reduction. This seems reasonable since each object consists of an integer and two strings of equal size, and we’re omitting one of the strings.

Server-side cursors

Depending on which database you’re using, even if you disable all Django data caching, you may still run into memory problems if the underlying database driver caches data. In this case, you may need to access the database driver directly in order to use a server-side cursor instead. Luckily, it looks like native support for server-side cursors is under active development, so this feature may be available in Django itself soon.

Try this if …

You have truly gigantic amounts of data and nothing else has worked.

Warnings

In order to go this route you’ll have to drop into writing raw SQL, which can be less pleasant and harder to maintain than working through the Django ORM, so only try this as a last resort.

Example

I wouldn’t recommend this except in extreme cases but there is a good example at the link below.

Other resources


Want to join our team?

We’re committed to applying geospatial technology for civic and social impact while advancing the state-of-the-art through research.

Learn more about our project work and sign up for job notifications to be the first to hear about open positions!