JOINs via denormalization for NoSQL coders, Part 2: Materialized views
In part 1 we discussed a workaround for JOINs on non-relational databases using denormalization in cases for which the denormalized properties of the to-one side don't change. In this post we'll show one way to handle JOINs for mutable properties of the to-one side i.e. properties of users.
Let's summarize our current situation:
- We have users (the to-one side) and their photos (the to-many side)
- Photos contain their users' gender in order to use it in queries which would need JOINs otherwise
It's obvious that a solution for the problem of mutable properties on the to-one side has to keep denormalized properties up to date i.e. each time the user changes his/her gender (or more likely her hair color ;) we have to go through all of the user's photos and update the photos' denormalized gender. It is clear that we get into problems here if a user has thousands of photos to update because such an update would take too long. We need a scalable way to deal with such updates.
Background tasks to the rescue
One way to solve the update-problem is to start a background task each time a user changes his/her gender. It's clear that this solution comes with eventual consistency i.e. changes won't be visible immediately. Nonetheless in most cases that's acceptable and normally background tasks are freaking fast.
Using background tasks in order to update all photos of a given user isn't as simple as it seems. The devil is in the details. Let's assume that a background task tries to update a photo (i.e. some denormalized property of the user) while a user is editing some property of the same photo at the same time i.e. the photo's title for example. In such a scenario it can happen that changes of the user will be overwritten by the background task (or vice versa). To see this take the example of a user changing a photo's title:
- Background task gets a photo out of the database in order to update the denormalized gender (task holds version A of the photo)
- submit_view gets the same photo out of the database in order to update its title (view holds version A of the photo too)
- Background task saves the photo (version B of the photo is saved i.e. denormalized gender updated)
- submit_view finished denormalization updates and saves the photo (version C of the photo is saved i.e. photo's title updated)
The problem here is that version C doesn't contain the background task's changes i.e. updates to the denormalized gender (contained in version B) because the submit_view fetched the photo (version A) before the background task saved its changes to the user's denormalized gender. Thus the submit_view never knows about the background task's changes and overwrites them.
One way out of this problem is to use transactions. However this would force us to use transactions in background tasks as well as for all saves to photos in order to avoid such situations. This can slow down our high-traffic web site and forces us to remember to use transactions whenever we want to update a photo. Additionally Django's transactions aren't optimistic so we have to remember to use QuerySet.update()
. Also only few non-relational databases support optimistic transactions or atomic UPDATE operations.
Materialized views
Another way to solve this problem is to introduce a third model containing a one-to-one relation to the Photo
model. The basic idea behind this is to separate information used for querying (i.e. properties of photos used in queries) and the entities actually containing the information we care about (i.e. the photos itself).
class PhotoUserView(models.model):
# denormalize all data of the photo
denormalized_photo_title = models.CharField(max_length=100)
denormalized_photo_popularity = models.CharField(max_length=100)
denormalized_photo_user = models.ForeignKey(User)
....
# copy the user's gender into denormalized_gender
denormalized_user_gender = models.CharField(max_length=10)
# one-to-one relation used as primary key
photo = models.OneToOneField(Photo, primary_key=True)
As you can see all fields of the Photo
model are being denormalized into the PhotoUserView
as well as the gender of the user. This doubles the amount of storage used because we have to store an additional entity for each photo , but storage is cheap so this doesn't represent any disadvantage.
Of course PhotoUserView
points to User
too because it contains the denormalized foreign key denormalized_user
from the photo model. The figure doesn't contain this link because it doesn't help in understanding the concept of materialized views.
Given this model we can use the following efficient query to get photos for which we would've needed JOINs before
photo_pks = PhotoUserView.objects.filter(
denormalized_user_gender='female',
denormalized_photo_popularity='high'
).values('pk')
female_user_photos = Photo.objects.filter(pk__in=photo_pks)
The trick here is that we use the one-to-one field as the primary key for entities of PhotoUserView
so we only need to get their primary keys in order to fetch photos we are interested in. This technique is similar to the relation index (see Building Scalable, Complex Apps on App Engine, same technique used in nonrel-search too). Additionally on App Engine and most other NoSQL databases the pk__in
filter can efficiently batch-get all users. Queries which wouldn't need JOINs can still be done on the Photo
model directly
popular_photos = Photo.objects.filter(popularity='high')
The important advantage of materialized views is that we don't have conflicts between users editing properties of their photos and background tasks updating denormalized properties anymore. Let's take a closer look at why that's the case: if a user changes his/her gender, the corresponding background task has to update the denormalized gender of all entities on the PhotoUserView
model now and not on the Photo
model. As a result changes by users editing photos at the same time won't get into conflicts with background tasks updating PhotoUserView
entities anymore.
However changes to photos have to start their own background tasks too now in order to keep all denormalized properties in PhotoUserView
of the Photo
model up to date.
So basically all comes down to the following situations:
- User edits properties of a photo => We have to start a background task to update all denormalized properties of the corresponding
PhotoUserView
entity - User changes its gender => We have to start background tasks to update the denormalized gender of all
PhotoUserView
entities who point to that specific user
As a result materialized views solve the problem of having to use transactions whenever we want to update a photo.
Now you might object that background tasks will eat too much bandwidth and that map/reduce would be more efficient. However, unless you use CouchDB views, map/reduce isn't incremental. What we've built here is a materialized view which updates only the affected entities. In contrast, map/reduce implementations like in MongoDB rebuild the whole index and that requires a lot more resources if you have a large and popular website. As an optimization, if your database supports transactions or atomic UPDATE operations you can get rid of materialized views, but then you have to be disciplined about using transactions/QuerySet.update()
absolutely everywhere in your code. This becomes a problem if you want to reuse existing Django apps. Most of them use Model.save()
which isn't safe. Materialized views don't have these disadvantages.
In the next post we'll explain in detail when to start background tasks in order to avoid update conflicts and how to handle evil situations like database crashes.