Generating graphs from MySQL table data

Graphs and charts are a useful visual way to view historical data – they make it easier to detect trends and get a big-picture view of data. All we need is timestamped data – table rows that are stamped with a specific date/time format, that can be used to group rows into time periods.

Time stamped data

In order to aggregate table data by time periods / ranges, we need a date/time column in the table we want to analyze. Appropriate types for such a column include TIMESTAMP, DATETIME and DATE, but we can also use string / numeric types for grouping data together if they contain some sort of date/time information – though those will be much less flexible than native date/time types.

I usually opt to go with the TIMESTAMP format, for a couple of reasons –

  • TIMESTAMP data is stored internally as UTC timezone – this makes such data portable between servers without needing to adjust for timezones.
  • TIMESTAMP type have auto-initialization and updating features – using the CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP values. As of MySQL ver. 5.6.5 you can use those properties with DATETIME as well, but since most servers and linux distributions use lower (considered more stable) versions, it’s usually not an option.

The TIMESTAMP format is limited by its date range (1970-01-01 to 2038-01-19), so you might need to use DATETIME if you’re using dates far in the future or past.

Grouping time-stamped rows into time periods

Assuming we have time-stamped data in our table, we can now start aggregating it into specific time periods. The most commonly used time periods for grouping are:

  • Day
  • Week
  • Month
  • Year

Lets take an example table schema and see how we group the data for each period.

- (int) `id`
- (varchar) `name`
- (smallint) `role`
- (timestamp) `created`

A very simple users table. Our time-stamped column is obviously `created`, and we’ll be using it to group our data.

Daily data

Our first task is to extract daily new users data from the table. We’ll do this by extracting the date portion of the timestamp using the DATE() function.

    DATE(`users`.`created`) AS `date`,
    COUNT(`users`.`id`) AS `count`
FROM `users`
WHERE `users`.`created` BETWEEN '2012-01-01 00:00:00' AND '2012-01-31 23:59:59'
GROUP BY `date`
ORDER BY `date`

This returns data that looks like the following:

date count
2012-01-01 12
2012-01-02 12
2012-01-03 36
2012-01-04 34
2012-01-05 36
2012-01-06 29

Note that I limited the query to a time range of 1 month (January 1st to January 31st). I used this particular format for a couple of reasons:

  • It allows MySQL to use an index for filtering the result. Had I used the DATE() function in the WHERE clause ( DATE(`users`.`created`) BETWEEN ... ), an index could not have been used resulting in degraded performance depending on the number of rows in the table.
  • In order to include all the data in the edge dates, I specified the time as well – from the beginning of the first day to the end of the last day.

Weekly data

This time we want to group data by week, while still preserving the year in case our range spans more than one year in dates (for example December – January). There are a couple of ways to do this –

  • DATE_FORMAT(`users`.`created`,'%X-%V') – returns data in the format 2012-23 (the second part is the week number)
  • YEARWEEK(`users`.`created`) – returns data in the format 201223 (last two digits are the week number).

I recommend using YEARWEEK() as it is faster than DATE_FORMAT() which makes a difference if you have a large table. Both methods can accommodate weeks  starting on Sunday or Monday – check the MySQL documentation for the details.

Monthly data

This time around we do not have any native functions that extract both the month and year (though we could extract those separately and concatenate). We use formatting functions instead:

  • DATE_FORMAT(`users`.`created`,'%Y-%m') – returns data in the format 2012-02 (year-month)
  • EXTRACT(YEAR_MONTH FROM `users`.`created`) – returns data in the format 201202 (last 2 digits are the month)

Again, DATE_FORMAT() is the slower of the two, so I’d recommend using EXTRACT() for better performance.

Yearly data

For yearly data we do have a native function – YEAR() – as well as other formatting methods we used earlier. Our options are:

  • YEAR(`users`.`created`)
  • EXTRACT(YEAR FORM `users`.`created`)
  • DATE_FORMAT(`users`.`created`)

All three return the same format – a 4 digit year. YEAR() and EXTRACT() have roughly the same performance, while DATE_FORMAT() is the slowest by a significant margin.

Aggregate functions and advanced usage

So far we’ve dealt with the very simple case of counting rows from a single table. Real world use-cases often involve multiple tables and extracting other types of aggregated data. MySQL gives us many functions to extract aggregate data – including COUNT(), SUM(), AVG() and others.

There are a few common requirements and pitfalls that are not completely obvious in implementation that bear mentioning.

Counting distinct values

When we use the COUNT() function, we typically want to get the number of different rows for each period. This works well for single table queries, but when we start joining tables in our queries we might get surprising results. The reason for that is that one-to-many relationships cause rows to appear multiple times in the result, and thus be counted multiple times.

In that case, we want to use the DISTINCT operator before the column we are counting – COUNT(DISTINCT `users`.`created`).

This is also true if we want to get distinct values when counting a non-unique column. If we wanted to get the number of different roles in our example schema from the previous section, we would use COUNT(DISTINCT `users`.`role`). Using a COUNT() on `role` without the DISTINCT operator would have given the number of rows instead.

Counting occurrences of specific values

The COUNT() function gives us a simple solution to count either distinct values or the number of rows. Sometimes though, we would like to count the occurrences of specific values. Suppose the `role` column in the example users table from the previous section can contain the following values:

  • 1 – Regular user
  • 2 – Author
  • 3 – Moderator
  • 4 – Administrator

We want to know how many users have specific roles. MySQL does not provide a cookie cutter function for this scenario, so we need to combine some logic statements with the functions we do have. We’ll use the SUM() function on a CASE statement for each specific role:

    DATE(`users`.`created`) AS `date`,
    SUM(CASE WHEN `users`.`role` = 1 THEN 1 ELSE 0 END) AS `Regular users`,
    SUM(CASE WHEN `users`.`role` = 2 THEN 1 ELSE 0 END) AS `Authors`,
    SUM(CASE WHEN `users`.`role` = 3 THEN 1 ELSE 0 END) AS `Moderators`,
    SUM(CASE WHEN `users`.`role` = 4 THEN 1 ELSE 0 END) AS `Administrator`
FROM `users`
WHERE `users`.`created` BETWEEN '2012-01-01 00:00:00' AND '2012-01-31 23:59:59'
GROUP BY `date`
ORDER BY `date`

This will generate results that look the following:

date Regular users Authors Moderators Administrator
2012-01-01 102 12 2 1
2012-01-02 87 8 1 0
2012-01-03 104 4 3 1
2012-01-04 150 19 5 2

Joining with accurate data aggregation

I’ve mentioned the DISTINCT operator as a method to COUNT() distinct values when joining multiple tables. This does not work for other aggregate functions, such as SUM() and AVG(), since the same value may occur more than once – however, we still would like to avoid the same row appearing more than once and ruining the accuracy of our aggregate data.

For this purpose we can use derived subqueries, which are subqueries appearing in the FROM clause of the query. Those are not the same as  correlated subqueries, which appear in the WHERE clause, and are often a source for low query performance. I’ve previously shown this technique in detail in a previous article on fetching specific rows from a group, even though the purpose was a bit different.

Formatting and displaying data in graphs and charts

So far we’ve discussed fetching the data for usage in our graphs, but that’s only the beginning. Now we need to format and process this data using a server-side language (I use PHP), and prepare it for outputting using a graphing / charting solution in Javascript or Flash. My favorite charting solution is a neat Javascript library called flot, which is very capable and customizable.

I will go over those parts in detail in future articles, and in the meantime you can check a component I developed that combines all three aspects into a Google Analytics style charting package for those who rather skip the grunt work and just generate graphs right away.

If you have any questions about this article, I’d love to hear about it in the comments!

To know when the next article is published, please subscribe to new articles using your Email below or follow me on Twitter.

Subscribe to Blog via Email

Enter your email address to receive notification about new posts.