Introduction
When considering the architecture of your database, you must be wary of allowing your tables to contain more columns than necessary to accomplish the majority of queries associated with that table. You might have heard of normalization, which is the process of organizing your tables in a way that minimizes both redundancy and data manipulation anomalies (insert, delete, update). While this process usually results in splitting tables, the reasons we perform a vertical partition differ from those of partitioning. Vertical Partitioning in our context will be defined as breaking apart the columns of high cardinality (number of columns) tables into distinct smaller tables based on frequency of usage of each column or set of columns. While some definitions require that the tables be in Boyce Codd Normal Form [1, p.584], in practice, this might not always be optimal. This depends on your own situation and what works best for you.
A Deeper Look At the Internals: Disk Storage and Memory Access
In many cases, you will not always be able to utilize a key of the table to perform a query. This is when a full table scan is performed. A full table scan is when you access all of the data sequentially to operate. In some cases, this is actually better than using a key. When a row is matched using a key a seek is performed to retrieve that row, if a large amount of rows are matched then you may end up performing more seeks than you a would with a full table scan. [2][3]
Memory Blocks
For a full table scan, when the database wants to find matching rows it must first pull all the rows sequentially into memory and perform its work. The DBMS can only pull in finite amount of data into memory to work on it. These chunks of finite data that the DBMS reads in known as a Memory Block. For instance, InnoDB uses a 16Kb page size [4] which means that it will pull records into memory 16kb at a time to perform work on it. Think of a page of memory as a “window” into a segment of memory on your system that is used to pull data and work with it. We can now determine how many records can be pulled into each block of memory by determining (1) the record length (size of an individual row) (2) the block size. [1, p.503]

Let’s look at a simple example:
- Suppose a record length (determined by the sum of the sizes of each column in the row) is 1.5kb.
- If we are using InnoDB the page size is 16kb.
- Table size is 1500kb
The number of records per block is: floor(block size / record length) = floor(16 / 1.5) = 10 records per block. This is known as the Blocking Factor. [1, p.503]
To read the whole table into memory we determine this by: ceil(table size / blocking factor) = ceil(1500 / 10) = 150 block reads. [1, p.503]
The time a block read takes depends on the seek time of the disk. However, we can intuitively determine that the less number of blocks needed to read into memory, the less the number of seeks to perform and thus faster performance. This also applies when scanning with keys, so for this context we will not differentiate between using or not using a key, that is reserved for a different discussion. Also, whenever you run a query a certain number of times the contents of that query get “warmed up” into the OS’s cache so consequent executions of the query will not cause a disk seek [5]. However, this does not mean that memory block reads are not performed anymore, the number of block reads remain constant, however their speed is greatly improved from accessing the RAM or other caches on the system instead of going to the hard drive. This means that while the block read is faster [6], it will still slow down as your data scales.
This is the hand-waiving explanation as to why reducing the number of columns to that only those needed most for the highest number of queries is important. Now let’s look at some real numbers to see the difference.
A Practical Example of Vertical Partitioning
For these query timing tests we will have the query cache turned off and will not be using keys to optimize any queries. Also all queries are “warmed up” as that was the best way for me to get consistent measurements.
Let’s suppose we are running a social networking site and we have a “user” table that contains many attributes of that user, as portrayed here (I am using MySQL Workbench to model the data):

This table is not, in fact, overly-large. However, there is room for improvement. As you can imagine, many queries will be executed that do not need to consider all of these columns, and as described earlier means more block processing and the extra I/O will slow us down. Because I wanted the data to model very closely what a real-world data set might look like (instead of random data) I wrote a script that creates random suitable values for each column. Here is an example of a row in the user table in CSV format:
1, Jeff, Dara, Dara1265430074695@example.com, c2e3103d28633677e173839977ce47c2, 2010-01-08 01:44:18, 0, NULL, NULL, NULL, NULL, NULL, 0, 0, 1, NULL, NULL
2, Noah, Kit, Kit1265430074768@example.com, 6c6a9ac3d5ea78d67f0a9c2364a4bf61, 2010-01-06 10:22:14, 0, Pittsburgh, NY, 60336, 5554206162, Kit1265430074326@example.com, 1, 0, 0, Poway, 2009-12-29 17:53:30
Because some of the fields were nullable I made 1 in 5 records use NULL for those records to better simulate a variety in the data input. Also the city/state names were completely randomized, so the table ended up having incorrect city/state pairings but for the purposes of this test that wasn’t really important.
Here are the queries that we will be running against the database:
Query’s Goal: A user attempts to log in, we check the user-entered password/email with our existing records. As a side note: In a real environment, for security purposes, you should never save your passwords in plain text, therefore, we are saving all of our passwords as the MD5 of their password.
SELECT `id` , `first_name` , `last_name` FROM `user` WHERE `email` = 'Dara1265430074695@example.com' AND `password` = 'c2e3103d28633677e173839977ce47c2';
Query’s Goal: We want to send out an email blast to all of our Tucson users that haven’t logged in since last year while honoring their email preferences, perhaps we want to remind them of our application and to get back on.
SELECT `id` , `first_name` , `last_name` , `email` FROM `user` WHERE `city` = 'Tucson' AND `receive_digest` = 1 AND YEAR( `last_login` ) = 2009;
Query’s Goal: We want to graph a trend of our performance history, monitoring the number of new signups we are getting on a monthly basis. Hopefully the numbers are looking good for our performance review.
SELECT COUNT( 1 ) AS count, MONTH( `date_registered` ) , YEAR( `last_login` ) FROM `user` GROUP BY MONTH( `date_registered` ) , YEAR( `date_registered` )
Most of these queries only use a subset of the columns stored in the table. I created 3 separate versions of this same table with 1000, 10,000 and 100,000 records each with populated data. Before I show the times for these queries I want to first introduce the “split” version of this table, then give a side by side comparison of the performance difference. Here is the proposed design change:

The large user table has been split into different tables in order to not only accommodate our common queries but in a structure that makes good logical sense. The rest of the development will also appreciate the setup. Keep in mind we are not normalizing our table as mentioned earlier, we did not address our functional dependencies, the “user_address” table is still in second normal form as was the original user table. However, it is acceptable in some situations to intentionally leave the table in a lower-order form for performance reasons, this is called denormalization.
Now that we have seen the proposed design change, let’s see now how this changed out performance:



Conclusion
Having too many columns can bloat your record size, which in turn results in more memory blocks being read in and out of memory causing higher I/O. This can hurt performance. One way to combat this is to split your tables into smaller more independent tables with smaller cardinalities than the original. This should now allow for a better Blocking Factor (as defined above) which means less I/O and faster performance. This process of breaking apart the table like this is a called a Vertical Partition. There is a catch, however, as you might have noticed on the second chart, the performance actually dropped after breaking apart the tables. This happened because of all the expensive JOIN operations that were required to get the desired information. Keep in mind that if you are to break apart your tables, make sure you try to minimize the number of JOIN operations you will need to end up doing to get the same data set back that you used to. In our example we can justify the performance hit by claiming that the overall benefit of having the split table is better than having them together, and that the query used to join all those tables doesn’t occur frequently. This catch reminds us that there are no black and white solutions when it comes to database design and optimization, you always have to do what is best for your situation.
REFERENCES
[1] – Elmasri and Navathe. Fundamentals of Database Systems. Boston: 2007.
[2] – MySQL Documentation – 7.4.4. How MySQL Uses Indexes – http://dev.mysql.com/doc/refman/5.5/en/mysql-indexes.html
[3] – IBM solidDB Sql Guide, Full table scan – http://publib.boulder.ibm.com/infocenter/soliddb/v6r3/index.jsp?topic=/com.ibm.swg.im.soliddb.sql.doc/doc/full.table.scan.html
[4] – InnoDB Plugin 1.0 for MySQL 5.1 (Early Adopter Release) User’s Guide – http://www.innodb.com/doc/innodb_plugin-1.0/
[5] – MySQL Documentation – 7.4.6. The InnoDB Buffer Pool – http://dev.mysql.com/doc/refman/5.0/en/innodb-buffer-pool.html
[6] – MySQL Documentation – A Practical Look at the MySQL Query Cache – http://dev.mysql.com/tech-resources/articles/mysql-query-cache.html
