JOINs via denormalization for NoSQL coders, Part 2: Materialized views

In part 1 we discussed a workaround for JOINs on non-relational databases using denormalization in cases for which the denormalized properties of the to-one side don't change. In this post we'll show one way to handle JOINs for mutable properties of the to-one side i.e. properties of users.

Let's summarize our current situation:

We have users (the to-one side) and their photos (the to-many side)
Photos contain their users' gender in order to use it in queries which would need JOINs otherwise

It's obvious that a solution for the problem of mutable properties on the to-one side has to keep denormalized properties up to date i.e. each time the user changes his/her gender (or more likely her hair color ;) we have to go through all of the user's photos and update the photos' denormalized gender. It is clear that we get into problems here if a user has thousands of photos to update because such an update would take too long. We need a scalable way to deal with such updates.

Background tasks to the rescue

One way to solve the update-problem is to start a background task each time a user changes his/her gender. It's clear that this solution comes with eventual consistency i.e. changes won't be visible immediately. Nonetheless in most cases that's acceptable and normally background tasks are freaking fast.

Using background tasks in order to update all photos of a given user isn't as simple as it seems. The devil is in the details. Let's assume that a background task tries to update a photo (i.e. some denormalized property of the user) while a user is editing some property of the same photo at the same time i.e. the photo's title for example. In such a scenario it can happen that changes of the user will be overwritten by the background task (or vice versa). To see this take the example of a user changing a photo's title:

http://lh3.ggpht.com/_8v0Ka-uUQOY/TKBsCOeS-bI/AAAAAAAAAMI/wnyIEaobdEw/BGTask-User-conflict.jpg

Background task gets a photo out of the database in order to update the denormalized gender (task holds version A of the photo)
submit_view gets the same photo out of the database in order to update its title (view holds version A of the photo too)
Background task saves the photo (version B of the photo is saved i.e. denormalized gender updated)
submit_view finished denormalization updates and saves the photo (version C of the photo is saved i.e. photo's title updated)

The problem here is that version C doesn't contain the background task's changes i.e. updates to the denormalized gender (contained in version B) because the submit_view fetched the photo (version A) before the background task saved its changes to the user's denormalized gender. Thus the submit_view never knows about the background task's changes and overwrites them.

One way out of this problem is to use transactions. However this would force us to use transactions in background tasks as well as for all saves to photos in order to avoid such situations. This can slow down our high-traffic web site and forces us to remember to use transactions whenever we want to update a photo. Additionally Django's transactions aren't optimistic so we have to remember to use QuerySet.update(). Also only few non-relational databases support optimistic transactions or atomic UPDATE operations.

Materialized views

Another way to solve this problem is to introduce a third model containing a one-to-one relation to the Photo model. The basic idea behind this is to separate information used for querying (i.e. properties of photos used in queries) and the entities actually containing the information we care about (i.e. the photos itself).

class PhotoUserView(models.model):
    # denormalize all data of the photo
    denormalized_photo_title = models.CharField(max_length=100)
    denormalized_photo_popularity = models.CharField(max_length=100)
    denormalized_photo_user = models.ForeignKey(User)
    ....
    # copy the user's gender into denormalized_gender
    denormalized_user_gender = models.CharField(max_length=10)
    # one-to-one relation used as primary key
    photo = models.OneToOneField(Photo, primary_key=True)

As you can see all fields of the Photo model are being denormalized into the PhotoUserView as well as the gender of the user. This doubles the amount of storage used because we have to store an additional entity for each photo , but storage is cheap so this doesn't represent any disadvantage.

http://lh4.ggpht.com/_8v0Ka-uUQOY/TKBnm9dYB2I/AAAAAAAAALk/yDJ3R5QRH84/Materialized%20view.png

Of course PhotoUserView points to User too because it contains the denormalized foreign key denormalized_user from the photo model. The figure doesn't contain this link because it doesn't help in understanding the concept of materialized views.

Given this model we can use the following efficient query to get photos for which we would've needed JOINs before

photo_pks = PhotoUserView.objects.filter(
    denormalized_user_gender='female',
    denormalized_photo_popularity='high'
    ).values('pk')
female_user_photos = Photo.objects.filter(pk__in=photo_pks)

The trick here is that we use the one-to-one field as the primary key for entities of PhotoUserView so we only need to get their primary keys in order to fetch photos we are interested in. This technique is similar to the relation index (see Building Scalable, Complex Apps on App Engine, same technique used in nonrel-search too). Additionally on App Engine and most other NoSQL databases the pk__in filter can efficiently batch-get all users. Queries which wouldn't need JOINs can still be done on the Photo model directly

popular_photos = Photo.objects.filter(popularity='high')

The important advantage of materialized views is that we don't have conflicts between users editing properties of their photos and background tasks updating denormalized properties anymore. Let's take a closer look at why that's the case: if a user changes his/her gender, the corresponding background task has to update the denormalized gender of all entities on the PhotoUserView model now and not on the Photo model. As a result changes by users editing photos at the same time won't get into conflicts with background tasks updating PhotoUserView entities anymore.

However changes to photos have to start their own background tasks too now in order to keep all denormalized properties in PhotoUserView of the Photo model up to date.

So basically all comes down to the following situations:

User edits properties of a photo => We have to start a background task to update all denormalized properties of the corresponding PhotoUserView entity
User changes its gender => We have to start background tasks to update the denormalized gender of all PhotoUserView entities who point to that specific user

As a result materialized views solve the problem of having to use transactions whenever we want to update a photo.

Now you might object that background tasks will eat too much bandwidth and that map/reduce would be more efficient. However, unless you use CouchDB views, map/reduce isn't incremental. What we've built here is a materialized view which updates only the affected entities. In contrast, map/reduce implementations like in MongoDB rebuild the whole index and that requires a lot more resources if you have a large and popular website. As an optimization, if your database supports transactions or atomic UPDATE operations you can get rid of materialized views, but then you have to be disciplined about using transactions/QuerySet.update() absolutely everywhere in your code. This becomes a problem if you want to reuse existing Django apps. Most of them use Model.save() which isn't safe. Materialized views don't have these disadvantages.

In the next post we'll explain in detail when to start background tasks in order to avoid update conflicts and how to handle evil situations like database crashes.

Blog

Background tasks to the rescue

Materialized views