Graphs and charts are a useful visual way to view historical data - they make it easier to detect trends and get a big-picture view of data. All we need is timestamped data - table rows that are stamped with a specific date/time format, that can be used to group rows into time periods.
Time stamped data
In order to aggregate table data by time periods / ranges, we need a date/time column in the table we want to analyze. Appropriate types for such a column include
DATE, but we can also use string / numeric types for grouping data together if they contain some sort of date/time information - though those will be much less flexible than native date/time types.
I usually opt to go with the
TIMESTAMP format, for a couple of reasons -
TIMESTAMPdata is stored internally as UTC timezone - this makes such data portable between servers without needing to adjust for timezones.
TIMESTAMPtype have auto-initialization and updating features - using the
ON UPDATE CURRENT_TIMESTAMPvalues. As of MySQL ver. 5.6.5 you can use those properties with
DATETIMEas well, but since most servers and linux distributions use lower (considered more stable) versions, it's usually not an option.
TIMESTAMP format is limited by its date range (1970-01-01 to 2038-01-19), so you might need to use DATETIME if you're using dates far in the future or past.
Grouping time-stamped rows into time periods
Assuming we have time-stamped data in our table, we can now start aggregating it into specific time periods. The most commonly used time periods for grouping are:
Lets take an example table schema and see how we group the data for each period.
users - (int) `id` - (varchar) `name` - (smallint) `role` - (timestamp) `created`
A very simple users table. Our time-stamped column is obviously `created`, and we'll be using it to group our data.
Our first task is to extract daily new users data from the table. We'll do this by extracting the date portion of the timestamp using the DATE() function.
SELECT DATE(`users`.`created`) AS `date`, COUNT(`users`.`id`) AS `count` FROM `users` WHERE `users`.`created` BETWEEN '2012-01-01 00:00:00' AND '2012-01-31 23:59:59' GROUP BY `date` ORDER BY `date`
This returns data that looks like the following:
Note that I limited the query to a time range of 1 month (January 1st to January 31st). I used this particular format for a couple of reasons:
- It allows MySQL to use an index for filtering the result. Had I used the
DATE()function in the WHERE clause (
DATE(`users`.`created`) BETWEEN ...), an index could not have been used resulting in degraded performance depending on the number of rows in the table.
- In order to include all the data in the edge dates, I specified the time as well - from the beginning of the first day to the end of the last day.
This time we want to group data by week, while still preserving the year in case our range spans more than one year in dates (for example December - January). There are a couple of ways to do this -
DATE_FORMAT(`users`.`created`,'%X-%V')- returns data in the format 2012-23 (the second part is the week number)
YEARWEEK(`users`.`created`)- returns data in the format 201223 (last two digits are the week number).
I recommend using
YEARWEEK() as it is faster than
DATE_FORMAT() which makes a difference if you have a large table. Both methods can accommodate weeks starting on Sunday or Monday - check the MySQL documentation for the details.
This time around we do not have any native functions that extract both the month and year (though we could extract those separately and concatenate). We use formatting functions instead:
DATE_FORMAT(`users`.`created`,'%Y-%m')- returns data in the format 2012-02 (year-month)
EXTRACT(YEAR_MONTH FROM `users`.`created`)- returns data in the format 201202 (last 2 digits are the month)
DATE_FORMAT() is the slower of the two, so I'd recommend using
EXTRACT() for better performance.
For yearly data we do have a native function -
YEAR() - as well as other formatting methods we used earlier. Our options are:
EXTRACT(YEAR FORM `users`.`created`)
All three return the same format - a 4 digit year.
EXTRACT() have roughly the same performance, while
DATE_FORMAT() is the slowest by a significant margin.
Aggregate functions and advanced usage
So far we've dealt with the very simple case of counting rows from a single table. Real world use-cases often involve multiple tables and extracting other types of aggregated data. MySQL gives us many functions to extract aggregate data - including
AVG() and others.
There are a few common requirements and pitfalls that are not completely obvious in implementation that bear mentioning.
Counting distinct values
When we use the
COUNT() function, we typically want to get the number of different rows for each period. This works well for single table queries, but when we start joining tables in our queries we might get surprising results. The reason for that is that one-to-many relationships cause rows to appear multiple times in the result, and thus be counted multiple times.
In that case, we want to use the
DISTINCT operator before the column we are counting -
This is also true if we want to get distinct values when counting a non-unique column. If we wanted to get the number of different roles in our example schema from the previous section, we would use
COUNT(DISTINCT `users`.`role`). Using a
COUNT() on `role` without the DISTINCT operator would have given the number of rows instead.
Counting occurrences of specific values
COUNT() function gives us a simple solution to count either distinct values or the number of rows. Sometimes though, we would like to count the occurrences of specific values. Suppose the `role` column in the example users table from the previous section can contain the following values:
- 1 - Regular user
- 2 - Author
- 3 - Moderator
- 4 - Administrator
We want to know how many users have specific roles. MySQL does not provide a cookie cutter function for this scenario, so we need to combine some logic statements with the functions we do have. We'll use the
SUM() function on a CASE statement for each specific role:
SELECT DATE(`users`.`created`) AS `date`, SUM(CASE WHEN `users`.`role` = 1 THEN 1 ELSE 0 END) AS `Regular users`, SUM(CASE WHEN `users`.`role` = 2 THEN 1 ELSE 0 END) AS `Authors`, SUM(CASE WHEN `users`.`role` = 3 THEN 1 ELSE 0 END) AS `Moderators`, SUM(CASE WHEN `users`.`role` = 4 THEN 1 ELSE 0 END) AS `Administrator` FROM `users` WHERE `users`.`created` BETWEEN '2012-01-01 00:00:00' AND '2012-01-31 23:59:59' GROUP BY `date` ORDER BY `date`
This will generate results that look the following:
Joining with accurate data aggregation
I've mentioned the
DISTINCT operator as a method to
COUNT() distinct values when joining multiple tables. This does not work for other aggregate functions, such as
AVG(), since the same value may occur more than once - however, we still would like to avoid the same row appearing more than once and ruining the accuracy of our aggregate data.
For this purpose we can use derived subqueries, which are subqueries appearing in the FROM clause of the query. Those are not the same as correlated subqueries, which appear in the WHERE clause, and are often a source for low query performance. I've previously shown this technique in detail in a previous article on fetching specific rows from a group, even though the purpose was a bit different.
Formatting and displaying data in graphs and charts
I will go over those parts in detail in future articles, and in the meantime you can check a component I developed that combines all three aspects into a Google Analytics style charting package for those who rather skip the grunt work and just generate graphs right away.
If you have any questions about this article, I'd love to hear about it in the comments!