Database optimization techniques you can actually use

MySQL PHP Web development March 25th, 2011 by Eran Galperin

I just saw an article on Smashing Magazine titled "Speeding up your website's database". I love Smashing's contribution to the webdev community, but their articles are getting longer and more basic at the same time.

I understand the need for simplicity because of the wide audience of Smashing Magazine, but I'd wish they'd give something more than the absolute basics you could find in almost any other site out there. I also didn't like some of the methods mentioned there for profiling (or the code itself), so I here is my starter guide to optimizing database performance.

When do we optimize the database?

As is noted in the article, page load speed is important. It affects the user-experience as well as Google pagerank when it gets too slow. There are so many variables to account for when trying to improve page load, including page download weight (including all the various assets such as images, javascript and CSS), network latency, browser cache and server headers, server load (requests per second and memory and CPU usage) among others. Yahoo has a very nice guide for client-side performance tips. We're going to suspect the database as the culprit for the purposes of this article, but you should first observe the complete picture before deciding on what to optimize.

Aside from page load speed, a busy database can affect the rest of the server as well, meaning parts that don't use the database or have very fast running queries could start to slow down.

Profile first, optimize last

The basic rule of optimization is to never assume - always verify, using actual data. The process of collecting performance metrics and determining performance issues is called profiling. We want to know whether database performance is responsible for a significant part of our page load time.

Referring again to the smashing magazine article, the author suggests a profiling method that is basically correct however the implementation leaves a lot to be desired. We won't go into why using globals and outputting inside functions is not good practice, and the author even mentions that this could seriously mess up the layout of the site or the sessions and yet makes no attempt to give out a better solution.

We want to time how much queries are taking to run. There are plenty of timing solutions out in the open - such as PEAR_Benchmark, that there is simply no need to build your own unless you want the exercise. The concept is simple - store microtime() values before and after the query for later observation, and the difference would be the timing of the query with good accuracy.

If you are using a database abstraction class (and you should), incorporating a timer to profile every query should be a piece of cake - so no need to hunt down every query and modify the code around it as suggested in the SM article. Wrap the query method of your abstraction class with the timer and use the queries as the keys in the timing array. We used the Zend Framework for all of our previous projects and for our current startup, and it comes with a built-in support for profiling which makes it a breeze to get started.

Example code using Zend_Db_Profiler

  1. $db = Zend_Db::factory('PDO_MYSQL', $config); //Set up the database object
  2. $db -> getProfiler()->setEnabled(true); // turn on profiler
  3.  
  4. //Queries are performed on the page
  5. //...
  6.  
  7. // Where we want to show the results
  8. $profiles = $profiler -> getQueryProfiles(); //An array of all the query profiling

If you are using Firebug (which you should if you are running Firefox) you can install the FirePHP plugin for completely unobtrusive output. The Zend Framework comes with a FirePHP profiler that can send query results directly into your Firebug console.

  1. $profiler = new Zend_Db_Profiler_Firebug('All DB Queries');
  2. $profiler -> setEnabled(true);
  3. $db -> setProfiler($profiler);
  4.  
  5. //Queries are performed on the page
  6. //...
  7. // No need to output, query profiles will appear in your Firebug console

Pretty convenient. We can go over page by page without disturbing content or messing up sessions and even run this in a production environment, provided we load the profiler only for specific users. The output looks something like this:

It's very important to profile using a relevant dataset. If you profile on a development machine that has very different data (and probably much smaller tables) than your production machine, you will get very different results. You should create a test machine that resembles your live dataset as much as possible to get relevant data (as I've shown in an old article about profiling).

Another important note is to avoid looking at cached results. MySQL will cache certain queries - so verify the results of your profiling by running the queries while avoiding caches using SQL_NO_CACHE and other means. Compare the first run with subsequent runs of the same query to be sure you are seeing non-cached results.

Aside from profiling the queries in real time, we can also profile queries that are used by daemons and cron jobs and log the results to a file. MySQL has a built in feature in MySQL that can log slow queries for us while the database daemon is running. As of MySQL 5.1.21 we can get microsecond timing on queries (previously only one-second jumps were supported) so we can get very good measurements with the slow-query log.

The slow query log should be used for monitoring and be checked periodically for possible problems. The rate at which the log fills out also gives an indication of how much your database is slowing down over time and how much time you have left before you need to optimize.

Optimizing performance

Suppose we found out some problematic queries on slow pages. There are 4 basic ways to optimize query performance:

  • Rewrite the queries
  • Change indexing strategy
  • Change schema
  • Use an external cache

Examining query execution plans (EXPLAIN)

Before trying to optimize a slow query, we need to understand what makes it slow. For this purpose MySQL has a query examination tool called EXPLAIN. Add the reserved word 'EXPLAIN' at the beginning of your query to get the execution plan for the query. The execution plan literally 'explains' to us what the database is doing to optimize the query. The MySQL manual has a full reference guide to the different values that appear in the plan, and you can see a full walkthrough of using EXPLAIN to optimize a query in this slideshow on slideshare (as well as in the profiling article I linked to earlier).

This is a very useful tool, but like all other tools it should be used while being aware of its limitations.

Common optimizations

1. Looping queries

The most basic performance issues often will not be the fault of the database itself. One of the most common mistakes is to query in a loop without need. Most likely looped SELECT queries can be rewritten as a JOIN -

  1.  
  2. $query = 'SELECT id,name FROM categories';
  3. $rows = $db -> fetchAll($query);
  4. foreach($rows as $row) {
  5. $query = 'SELECT id,name FROM sub_categories WHERE category_id=' . (int) $row['id'];
  6. $subCategories = $db -> fetchAll($query);
  7. //...
  8. }

(I'm using Zend Framework syntax since we already assumed we are using a database abstraction class)

This could be rewritten as a join -

  1.  
  2. $query = 'SELECT categories.id,categories.name,
  3. sub_categories.id AS subcat_id,sub_categories.name AS subcat_name
  4. FROM categories
  5. LEFT JOIN sub_categories ON categories.id=sub_categories.category_id';
  6. $rows = $db -> fetchAll($query);
  7. foreach($rows as $row) {
  8. // Require a bit of additional logic to format the results, but we have one query instead of many
  9. }

Inserting and updating rows in a loop can have major overhead as well, and those queries are generally slower than simple SELECT queries (since indexes often need to be updated) and they affect the performance of other queries since they use table / row locks while the data is written (this differs depending on the table engine). I wrote an article almost two years ago on multiple row operations that covers how to rewrite looped INSERT / UPDATE queries and includes some benchmarks to show how it improves performance.

2. Picking only needed columns

It is common to see a wildcard used to pick all columns ('SELECT * FROM ... ') - this however, is not efficient. Depending on the number of participating columns and their type (especially large types such as the TEXT variants), we could be selecting much more data from the database than we actually need. The query will take longer to return since it needs to transfer more data (from the hard-disk if it doesn't hit the cache) and it will take up more memory doing so.

Picking only the needed columns is a good general practice to use, and avoids those problems.

3. Filtering rows correctly and using indexes

Our main goal is to select the smallest amount of rows we need and doing so in the fastest way possible. We want to filter rows using indexes, and in general we want to avoid full table scans unless it is absolutely needed (aside from edge cases where it actually improves performance). The MySQL manual has some great information on optimizing the WHERE clause, and I'll dive into a bit more detail -

Filtering conditions include the WHERE, ON (for joins) and HAVING clauses. As much as possible, we want those clauses to hit indexes - unless we are selecting a very large amount of rows, index lookup is much faster than a full table scan. Those clauses should be used along with the LIMIT clause if relevant to filter the amount of rows / data returned by the query. The LIMIT clause itself can lend some important optimizations for queries if used correctly.

Since our goal is to hit indexes with our WHERE clause, an important rule is to avoid using calculations there. When the filtering condition has to be calculated for each row, the WHERE clause cannot use an index.

Example - fetching users created in the last 4 weeks:

  1. SELECT id,name FROM users WHERE created - NOW() < INTERVAL 4 WEEK

Since the value of the `created` column changes from row to row, we now have a calculation in the left-hand side of the condition. This could be rewritten so that the calculation doesn't change and thus will only be performed once:

  1. SELECT id,name FROM users WHERE created > NOW() - INTERVAL 4 WEEK

This query can use an index on the `created` column and should perform much better.

Another less common calculation but a very problematic one is correlated subqueries in the WHERE clause.

Selecting the lowest priced fruit from several fruit types:

  1. SELECT type, variety, price
  2. FROM fruits
  3. WHERE price = (
  4. SELECT MIN(price) FROM fruits AS f WHERE f.type = fruits.type
  5. )

(Example taken from the excellent Xaprb article on the topic - you should read it). This query could be rewritten as JOIN, moving the subquery from the WHERE clause -

  1. SELECT f.type, f.variety, f.price
  2. FROM (
  3. SELECT type, MIN(price) AS minprice
  4. FROM fruits
  5. GROUP BY type
  6. ) AS minfruits
  7. INNER JOIN fruits AS f ON f.type = minfruits.type AND f.price = minfruits.minprice

Ideally, MySQL would've optimized both of those the same, but since it is usually not the case, rewriting correlated subqueries as joins is preferred.

4. Indexing correctly

Whether optimizations are needed or not is dependent on the EXPLAIN results we mentioned previously. If the execution plan indicates an index is not being used or a non-selective index has been picked, we need to understand why and change our indexing strategy accordingly (or use index hints).

MySQL can use one index per table alias in a query, so we need to plan our indexes to maximize their effectiveness. Using more indexes than is necessary can have adverse affects - as it slows down the operation of INSERT and UPDATE queries, while taking up more memory. Some indexes can even slow down performance depending on their selectivity.

If we use ordering clauses such as ORDER and GROUP BY, our indexes should often be composite indexes (indexes covering more than one column) to allow for both the filtering and ordering to use an index.

5. Picking the right engine for your data

MySQL has a pluggable engine design, which allows you to use different engine types to store your data, each with its own advantages and drawbacks. The two main engines are MyISAM and InnoDB, and the differences between them affect much more than just performance - InnoDB is an ACID compliant transactional engine, while MyISAM sacrifices some integrity and consistancy features for simplicity and performance. Having said that, MyISAM is not necessarily faster, only in some cases.

I use InnoDB for all of my table unless I need a full-text index (a MyISAM only feature), since my tables usually have a lot of write activity. InnoDB uses row-level locks which really helps performance for such use, while MyISAM uses table locks for write operations. It also optimizes write operations by treating indexes differently than MyISAM - InnoDB uses clustered indexes, so picking the right primary key is critical.

Caching

In the case that our optimizations does not yield sufficient performance benefits (either due to technical reasons or our own skill level), caching is a viable strategy to reduce database load. You should always try and optimize the database itself first, since caching will add another of complexity to our application.

MySQL has an internal query cache that caches results from frequently running queries if it meets certain requirements. If our queries are cached by MySQL (this can be verified by running the queries several times), there is no need to cache it - we just need to be aware that a MySQL service restart could cause a noticeable slow down while the cache is being primed again. You can read more about the dangers of the internal cache on the excellent MySQL performance blog (a must read for any serious MySQL user).

There are many caching strategies and that is the topic for another post. Common options include caching to disk (files) or caching to memory (using solutions such as memcache or APC). Another form of caching is to the database - by de-normalizing the schema to store data that is the result of expensive to run queries.

Server tuning and beyond

Everything covered here is just the tip of the iceberg - it gets rapidly more advanced as you get dig deeper, including tuning MySQL server variables (and you should - at least the basics), the server itself (hardware / software) and using related tools such as sphinx and lucene to offload some of the work. I tried to give a good starting point and as many references as possible for getting a good start to getting your database in shape.

I linked to several excellent resources in this article, such as the MySQL manual, MySQL performance blog and Xaprb (the last two are of Percona fame - world-class experts on MySQL). I suggest you start visiting those regularly as they offer excellent advice.

Tags: ,

Enter your email address to receive notification about new posts.

If you liked this article you should follow me on Twitter and/or share below:
  • http://profiles.google.com/johnston.joshua Joshua Johnston

    Good article! SQL tuning is enough to fill 10000 books . . . and that’s before even getting into database specific SQL.

  • Pingback: Cele mai kul - reKul.ro - ALFA VERSION

  • http://www.ampjuke.org Michael

    Great article. I don’t necessarily agree 100% with your initial statements about Smashing mag.
    However, your article is a very good starting point and – given the huge amount of external links – a great resource to obtaining a “getting started” point.
    Bookmarked it. Thanks.

  • Lideln

    Better article than on Smashing Magazine (even though they seem to have great design articles).
    However, I can only provide 2 important leads before tuning :
    - updating mysql server version (the more recent, the better/faster/stronger[harder])
    - using stored procedures, that spare multiple successive connections to the server, and their inherent costs

  • http://www.binpress.com Eran Galperin

    That is good advice, and if you are using the right abstraction class you should have support baked in for stored procedures.

    Regrading MySQL version upgrade, I agree with a caveat – you should probably stay at the most recent version for your distro (linux), due to stability concerns. Newer versions sometime offer performance enhancements but could also contain bugs.

  • Pingback: Оптимизация баз данных | neyro.it

  • Sarang Ghamandi

    very Good article. Thanks for sharing. It will help in interviews.
    Thank you so much.

  • Pingback: Programowanie w PHP » Blog Archive » Eran Galperin’s Blog: Database Optimization Techniques You Can Actually Use