Blog

Thomas Wanschik on January 19, 2010

Sharding with Django on App Engine

When developing a scalable application for App Engine you need to pay attention to how often a database entity is updated because you can only update any single entity with a maximal frequency of 1-5 times per second. If the frequency of updates for any single entity is higher than this limit you can expect your application to have contention. In order to prevent such situations you can use a technique called sharding. Sharding takes advantage of the datastore's ability to handle many parallel updates on multiple distinct entities efficiently and to handle reads much faster than writes.

Let's get to the example of a simple counter for which the frequency of updates is too high for a single entity (such a counter could be used for counting the number of views for a YouTube video):

class SimpleCounterShard(models.Model):
    """Shards for the counter"""
    count = models.IntegerField(default=0)
    name = models.CharField(primary_key=True,
        max_length=500)

    NUM_SHARDS = 20
    @classmethod
    def get_count(cls):
        """
        Retrieve the value for a given sharded counter.
        """
        total = 0
        for counter in SimpleCounterShard.objects.all():
            total += counter.count
        return total

    @classmethod
    @commit_locked
    def increment(cls):
        """
        Increment the value for a given sharded counter.
        """
        index = random.randint(0,
        SimpleCounterShard.NUM_SHARDS - 1)
        shard_name = 'shard' + str(index)
        counter = SimpleCounterShard.objects.get_or_create(
            pk=shard_name)[0]
        counter.count += 1
        counter.save()

The basic idea is to divide the counter into N sub-counters and to compute the counter's real value (get_count) by summing over all values of the N sub-counters. While trying to increment the counter's real value (increment_count) we select one of the N sub-counters at random. As a result we avoid contention. By increasing the number of shards you can increase the maximum throughput.

There exists an App Engine specific article about sharding counters from Joe Gregorio. If you want to learn more about sharding counters take a look at that article.

I ported the code (including GeneralCounterShard) from that article to native Django such that it can theoretically be used for any database supporting optimistic transactions. The code is available here and you can test the live demo.

Django's advantage

Now we can get to an exciting example of how Django's ORM can get us out of the stone age of non-relational database development practice. Let's see an example using F() objects:

YouTubeVideo.objects.filter(pk=keyname).update(
    views_count=F('views_count')+1)

The database backend can detect such "counting updates" and use sharding for App Engine or other techniques like updates via background tasks for other databases automatically in order to perform such an update. Thus Django gives us the advantage of an additional layer of abstraction. Moreover, we can formulate "counting updates" independently of the database used, so we can switch to a different database without having to change the code.

Keep in mind that sharding is a technique which can be used for much more than just counters. This is our first App Engine app ported to native Django. If you plan to port some other App Engine app to Django you can use the sharding counter port to get an idea of how to do it. And please let us know about it.