Lab 2

Let's kick it up a notch with some Django Debug Toolbar!

Lab 2.1

Change to the correct branch:

git checkout lab-2.1

Report

The site seems to be running slower lately. Please make the site fast again!

To reproduce:

Browse to the posts page.
Browse to a post's page.
Browse to the posts admin page.
Why are these slow?

Note, given that we're dealing with SQLite locally, the "slowness" is largely imaginary, so please play along. Additionally, the Post detail view has caching. The cache can be cleared by opening a Django shell (python -m manage shell) and running:

from django.core.cache import cache
cache.clear()

Facts

Let's consider what we know:

The pages render correctly and there are no errors.
The pages rendered "fast" earlier, but over time as the data set has grown, they have slowed down.

Investigation

What queries are running?
- Use the Django Debug Toolbar's SQL Panel
- In a properly working case, the count should be relatively low, we're only rendering one type of data on the page. How many do we see?
Should a query be using an index?
- Fields that are likely to qualify for indexes are those that are used in filtering and ordering.
- Use the SQL Panel's "Explain" button to view the database's breakdown of the query. Look for portions that use SCAN without mention of an index. This means they are iterating over the entire table comparing all the values.

Hints

Review each of the pages individually. Starting with the posts page, we'll see that the SQL panel of the toolbar indicates there are over 100 queries being run on this page. Clicking on the SQL panel shows a duplicated SQL query such as:

SELECT ••• FROM "newsletter_category" INNER JOIN "newsletter_post_categories" ON ("newsletter_category"."id" = "newsletter_post_categories"."category_id") WHERE "newsletter_post_categories"."post_id" = 2654 ORDER BY "newsletter_category"."title" ASC 100 similar queries.

This suggests we would need to use .select_related() or .prefetch_related(). Which depends on the data model. Since these queries are fetching the category, and we know that Post.categories is a ManyToManyField, we'll need to use .prefetch_related(). Since the categories need to be specifically ordered, you'll need to use a Prefetch instance rather than just a string.

Moving onto an individual post page, the SQL panel only shows 1 query. Clicking into that panel, we'll see some really low value, like 5ms. This in pretty much every definition is fast. However, let's trust the reporter that this is actually slow. The next step is to understand what about this query is taking so much time. Clicking the Explain button at the end of the row for the query will show the SQL query that was executed as well as the result of the Explain statement. The Detail column shows SCAN newsletter_post which means SQLite is looking at every row individual of the table to find the post with the given slug. This is very inefficient. Adding an index will solve this. This can either be done with db_index on the model field or setting the Meta.indexes on the model.

Checking the posts admin page, we can see the SQL panel is showing over 100 queries again. This means we need to use .prefetch_related() again. A likely spot to use it would be in the def get_queryset(self, request, *args, **kwargs) method.

Conclusion

Admittedly, this is a hard area to know the "fix" for. Performance optimization is a never ending, relentless battle.

Post listing admin

The admin page suffers from a N+1 problem and needs to make use of prefetch_related since it renders each category of the post on the page. This can be chained on the QuerySet by overriding ModelAdmin.get_queryset. The need for prefetch_related is evident from the 100 duplicated queries that are fetching data from the table newsletter_post_categories. That table is the intermediate table used with a models.ManyToManyField. There is a slight wrinkle in that the categories are being rendered in order of the categories' titles. To push that to the database, you must use a Prefetch object that specifies:

def get_queryset(self, request, *args, **kwargs):
    return super().get_queryset(request, *args, **kwargs).prefetch_related(
        Prefetch(
            'categories',
            queryset=Category.objects.all().order_by('title')
        )
    )

This means PostAdmin.categories_list would change to:

@admin.decorators.display(description="Categories")
def categories_list(self, obj):
    return ", ".join(category.title for category in obj.categories.all())

There should also be an indexes on created. Arguments could be made to add indexes on the fields that are search-able and is_published. I would stop at created because that's used on the default loading. Any other cases would need to be proven as common occurrences.

Post listing

The listing of posts also suffers from an N+1 problem, though it could use the more simple approach of prefetch_related("categories"). This is because the template posts/includes/list_item.html doesn't order the categories in {% for category in post.categories.all %}.

There are two facets to this view that could benefit from an index. The ordering based on Coalesce('publish_at', 'created') and the filter on is_published=True. This scenario could benefit from a multi-column index. The typical maxim is to apply that index on the most generic columns first, to the more specific. In this case, it would first be is_published, then the coalesced datetime fields:

This would look like:

class Post(TimestampedModel):
    # ...
    class Meta:
        # ...
        indexes = [
            models.Index(
                F("is_published"), Coalesce("publish_at", "created").desc(),
                name='published_posts_idx'
            )
        ]

Interestingly enough (I'm not a DBA), the above only covers the count portion of the view. The pagination that selects the data does not hit this index and requires a different index.

class Post(TimestampedModel):
    # ...
    class Meta:
        # ...
        indexes = [
            models.Index(
                Coalesce("publish_at", "created").desc(),
                name='recent_first_idx'
            )
        ]

As a peer, I would probably resort to only using recent_first_idx and waiting until the site slowed down and required a more finely tuned index. However, there is an argument to be made that both are necessary.

Post detail

This view generates a fixed number of queries, but does a scan on the entire table for the slug field, SCAN TABLE newsletter_post. The solution here is to either use CharField(..., db_index=True) or switch to SlugField(...) which automatically creates an index under the hood.

Further consideration

Please keep in mind to avoid pre-optimizations. This exercise exists to help you go to the furthest extent possible in those cases where we need to squeeze all the speedy goodness out of the application.

That said, consider the application(s) you work on. What are the most frequently used parts? What do you have in place to catch slowness? Could you benefit from using assertNumQueries(...)?

Lab 2.2

Please skip this lab. The Django Debug Toolbar's Profile panel isn't compatible at this time

Change to the correct branch:

git checkout lab-2.2

Report

The analytics view specifically is running slow. Can you take a look at it?

To reproduce:

Browse to the analytics page.
Why has this become slow?

Note, if you're running a nice machine, this slowness may be imaginary. So please humor me and pretend it's slow.

Facts

Let's consider what we know:

The page renders correctly and there are no errors.
The page rendered "fast" when it was first introduced, but has since slowed down.

Investigation

Do we have a fixed number of queries running?
Are the queries running in a reasonable amount of time?
Is there a section of code that's running slowly?
- Enable the Profiling panel and refresh the page.
  - You may need to increase the value of PROFILER_THRESHOLD_RATIO in the settings.py file to find the culprit in project/newsletter/views.py.
- If the slowness isn't obviously from the SQL panel, we need to see where time is being spent in the application.
- Look for long durations or a high number of iterations.
- Can the code be refactored to run more efficiently?

Conclusion

The solution here is to rework the entire Analytics view to push the calculations into the database. The view is doing simple filter and aggregate work so it makes little sense to pull the entire model instance out of the database for each row. Additionally, it's creating a new datetime instance in determine_buckets and performing three logical comparisons for every Post and Subscription. Not to mention analyze_categorized_model does another three logical comparisons for each instance.

The Profiling panel highlights the lines of code that exist in our project to help draw our eyes to where the problem is most likely caused.

Further consideration

Similar to the previous lab, knowing when to use a profiler can be difficult. Rather than applying it everywhere, it makes the most sense to start with either knowingly slow areas or critical areas. Take a moment to reflect on your own coding journey and any times that a profiler would have been helpful.

Lab 2.3

Change to the correct branch:

git checkout lab-2.3

Report

I think the analytics view is broken. The values don't match what I'd expect to see, can you look into them?

To reproduce:

Browse to the analytics page.
It doesn't work.

Facts

Let's consider what we know:

The page is rendered correctly, but the data may be wrong.
It's unknown if this was ever working correctly, but it certainly is wrong now.

Investigation

What values should we expect to see on the analytics page?

For this you need to inspect the database somehow, here are a few options.

Open up a repl with python -m manage shell and query the data.

from datetime import timedelta
from django.utils import timezone
from project.newsletter.models import Post, Subscription
print(
    "Posts",
    Post.objects.filter(
        created__gte=timezone.now() - timedelta(days=180)
    ).count()
)
print(
    "Subscriptions",
    Subscription.objects.filter(
        created__gte=timezone.now() - timedelta(days=180)
    ).count()
)

Use the SQLite browser by opening the file directly (you must have SQLite already installed).

SELECT COUNT(*) FROM newsletter_post WHERE created >= date('now','-180 day');
SELECT COUNT(*) FROM newsletter_subscription WHERE created >= date('now','-180 day');

Open a SQLite shell with python -m manage dbshell (you must have SQLite already installed).

SELECT COUNT(*) FROM newsletter_post WHERE created >= date('now','-180 day');
SELECT COUNT(*) FROM newsletter_subscription WHERE created >= date('now','-180 day');

What is different in the query that causes the Post count to return correctly, but not the Subscription count?
- Use the SQL Panel to inspect the query that is counting the objects.
- Click on the "+" button on the left to expand the query.
- It may be easier to read clicking on the "Select" button.
- You can also improve the readability by copying and pasting the query into an online formatter such as sqlformat.org
Does removing categories__isnull=False from the Subscription QuerySet resolve the issue?
Does adding categories__isnull=False to the Post QuerySet cause an unexpected result?

Hints

If you've gone through all the steps of the investigation, you should have determined that the inclusion of the join, LEFT OUTER JOIN "newsletter_subscription_categories" causes duplicates. This is because the joined table may have multiple matches for any subscription row causing the COUNT function to find more than one, leading to an inflated count.

This can be fixed by using an appropriate GROUP BY clause in the SQL. To use this in the ORM, we can either inspect the code or use the docs to see what options Django ORM'sCount provides us. You can use the docs or inspect the code (right-click on Count and choose "Go To Definition") in your IDE if you're using PyCharm or VSCode.

We can see that Count subclasses Aggregate which has distinct as a param. This matches the documentation as well.

Conclusion

The solution here is to use distinct=True in the call to Count. Traversing many-to-many relationships or reverse foreign key relationships can be tricky. You don't want to use distinct() and distinct=True everywhere because if it's unnecessary, you're needlessly slowing down your application.

This problem highlights one of the underlying issues with using an ORM. It abstracts away the underlying SQL and makes it easy to misunderstand what the database is doing.

Further consideration

This bug is very insidious, it's easy to miss in development and a code review. It's possible to write a test that misses it if the data is setup such that there is only one category for each subscription. Finally, a developer may not be familiar with the data to know when a value "looks wrong" resulting in the bug being found downstream by the actual users. Can you think of some practices that would help avoid this bug?

Lab 2.4

Change to the correct branch:

git checkout lab-2.4

Report

Thank you for adding caching to the site recently, but I think you broke something. A post that shouldn't be public is available to the public now.

To reproduce:

Log into your staff account and browse to the published posts.
Use an incognito window to also view the published posts.
In the incognito window, click to read a post.
In the staff authenticated window, click "Set to private" for the post opened in the incognito window.
In the incognito window, refresh the page.
This should 404, but it's still available to the public.

Facts

Let's consider what we know:

The page was properly requiring authenticated users for private posts before caching was added.
We only set the cached response for view_post when a Post is public.
We are clearing the cache for a Post when it's updated via a signal handler in receivers.py.

Investigation

Why does the post still return despite being private?
- Use the Django Debug Toolbar's Cache Panel to inspect what cache operations are occurring.
Does the page return a 404 when the cache isn't set?
- Open a Django shell python -m manage shell:
```
from django.core.cache import cache
cache.clear()
```
- Refresh the page to see if it 404's.
Does updating a post to be private via the update view result in the cache being busted?

Hints

You can edit a post via the "Edit" link near the top of the detail page. Otherwise the URL is http://127.0.0.1:8000/post/<post_slug>/update/
Save the post.
- The response is a redirect to avoid multiple posts.
- However, this means the toolbar is presenting you the data for the 301 redirect response, not your POST request.
Click on the History Panel
Find the POST request to the .../update/ URL and click "Switch".
Click on the Cache Panel and view the operations.

Does "Set to private" / "Set to public" delete the cache instance?
1. Browse to the post listing page.
2. Click "Set to private" or "Set to public"
3. Click the History Panel.
4. Find the POST to the .../toggle_privacy/ URL and click "Switch".
5. Inspect the Cache Panel for operations.

At this point it's clear there's a difference in functionality between how toggle_post_privacy and update_post save the data changes. toggle_post_privacy updates the instance using a QuerySet.update() call while update_post uses PostForm.save() which under the hood is doing something like Model.save(). If we read the documentation for update we'll see that there's a note that the post_save signal isn't emitted. This explains why toggle_post_privacy doesn't delete the cached value because the on_post_save signal receiver in receivers.py is only listening on post_save.

Conclusion

The root cause here is that Post.objects.filter(...).update() does not trigger the post_save signal. This means the cache key is not deleted leading to posts continuing to be publically available.

There are a couple of solutions. One is to clear the cache in toggle_post_privacy. Another would be to avoid using .update(), but only change the public field via:

post = get_objects_or_404(...)
post.is_public != post.is_public
# Include updated to keep our updated timestamp fresh.
post.save(update_fields=['is_public', 'updated'])

The above will only change the fields we intended to change. It does mean a second database query (1 to fetch, 1 to update), but that's pretty minor.

A third option is to stop using on post_save for cache invalidation and to handle all that logic manually within functions in the operations.py module. This approach has more philosophical implications that you'd need to sort out. Such as, what do you do about ModelForm instances since they mutate the database?

Further consideration

You probably already know the typical questions I am going to ask. So I won't ask them. After using the Django Debug Toolbar throughout this lab, what would make your life better as a developer? Could you find some time to submit an issue and open a PR?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lab 2

Lab 2.1

Report

Facts

Investigation

Conclusion

Post listing admin

Post listing

Post detail

Further consideration

Lab 2.2

Report

Facts

Investigation

Conclusion

Further consideration

Lab 2.3

Report

Facts

Investigation

Conclusion

Further consideration

Lab 2.4

Report

Facts

Investigation

Conclusion

Further consideration

FilesExpand file tree

lab2.md

Latest commit

History

lab2.md

File metadata and controls

Lab 2

Lab 2.1

Report

Facts

Investigation

Conclusion

Post listing admin

Post listing

Post detail

Further consideration

Lab 2.2

Report

Facts

Investigation

Conclusion

Further consideration

Lab 2.3

Report

Facts

Investigation

Conclusion

Further consideration

Lab 2.4

Report

Facts

Investigation

Conclusion

Further consideration