Choosing InnoDB primary key columns
In the chapter introduction we promised to shed some light on how to choose your InnoDB
primary key columns sensibly. Be advised that choosing good primary key columns is not
an exact science—there are multiple aspects that infl uence this decision. Depending on
your needs and preconditions you will want to prioritize them differently from one table to
the next. Consider the following as general advice rather than hard rules that must be
obeyed unconditionally.
Getting ready
In order to make reasonable decisions on primary key columns, it is important to have a
very clear understanding of what the data looks like in the table at hand. If you already have
existing data that is to be stored in an InnoDB table—for example in MyISAM format—it can be
helpful to compare it with the criteria below.
If you are planning a new schema, you might have to guess about some characteristics of the
future data. As is often the case, the quality of your choices is directly proportional to how
good those guesses are.
This recipe is less strict step-by-step instructions that must be followed from top to bottom
and should be considered a list of properties a good primary key should have, even though
you might decide some of them do not apply to your actual environment. As a rule of thumb,
however, a column that fulfills all or most of the attributes described below is most probably
a sensible choice for a primary key. See the How it works… section for details on the
individual items.
How to do it…
- Identify unique attributes: This is an absolute (technical) requirement for primary
keys in general. Any data attribute that is not strictly guaranteed to be free of
duplicates cannot be used alone as a primary key. - Identify immutable attributes: While not absolutely necessary, a good primary key is
never changed once it has been assigned. For all intents and purposes, you should
avoid columns that have even a small chance of being changed for existing records. - Use reasonably short keys: This is the “softest” criterion of all. In general, longer keys
have negative impacts on overall database performance—the longer the worse. Also,
consider a prefix primary key. See Using prefix primary keys earlier in this chapter for
more information. - Prefer single-column keys: Even though nothing prevents you from choosing
a composite primary key (a combination of columns that together form the
uniqueness), this can easily become a hassle to work with, especially when handling
foreign keys. - Consider the clustered index nature of the primary key: As InnoDB’s primary key is
also clustered, you should take this special nature into account as well. It can speed
up read access a lot, if you often have to query for key ranges, because disk seek
times will be minimized.
How it works…
In the following sections, we will try to shed some light on what each step of the recipe is
concerned with in a little more detail.
Uniqueness
An absolute requirement for primary keys is their uniqueness. Every record in your table
will have to have a distinct value for primary keys. Otherwise, neither MySQL nor any other
database product for that matter could be sure about whether it was operating on exactly
the right rows when executing your queries.
Usually, most entities you might want to store in a relational database have some sort of
unique characteristics that might be a suitable Primary key. If they do not, you can always
assign a so-called surrogate key for each record. Often this is some sort of unique numeric
value, either generated by an application working on top of the database or MySQL itself using
an AUTO_INCREMENT column.
Immutability
Primary key columns should generally be (virtually) immutable, that is, under no circumstances
should you have to modify their values, once they are inserted into the database.
In our books example, the ISBN number cannot be changed once a book has been published.
The same would apply for a car’s chassis number.
Technically, of course, they can be changed after their creation. However, this will be very
difficult to perform in practice, once the original value has been used to establish foreign key
relationships between tables. In these cases, you will often have to revert to complicated and
even unsafe methods (risking data inconsistencies) to perform the changes.
Moreover, as the primary key is stored as a clustered key in InnoDB, changing its value will
require the whole record—including all columns—to be moved to its new location on disk,
causing additional disk I/O.
Note that sometimes columns that may at first seem constant over time really are not. For
example, consider a person’s social security number. It is designed to be unique and can
never change or be reassigned to a different human being. Consequentially, it would seem
like a good choice for primary key in a table of people.
But consider that in most cases data will be entered into the database manually—be it
through forms, text file imports, among others. In some form or another, someone will have
typed it in through a keyboard.
Manual input is by definition an error prone process. So you might end up with a person’s
record that has two digits transposed in their primary key social security number without
immediately knowing it. Gradually, this wrong value will spread through your database—it will
be used in foreign key relationships, forming complex data structures. When you later find out
about the error—for example, because another person who really owns that number needs to
be inserted—then you are facing a real problem.
Unless you are absolutely and positively sure a value can never change once it has been
assigned to a record, you should consider adding a new column to your table and use a
surrogate key, for example, an auto-incrementing number.
Key length
There are several reasons for keys being as short as possible. InnoDB basically only uses one
single large heap of memory—the buffer pool—for its caching purposes. It is used for both row
and index data, which are stored as memory cached copies of individual pages straight from
the tablespace data files. The shorter each key value is, the more of them fit into a single
data page (the default size is 16 KB). For an index with 16 bytes per index value, a single
page will contain about a thousand index entries. For an index with only 8 bytes per entry,
twice as many values can be cached in the same amount of space. So to utilize the effects
of memory-based caching, smaller indexes are better.
For the data record as a whole there might not be much of a difference between 8 or 16 bytes
compared with the overall record length. But remember (or refer to the chapter introduction
if you don’t) that the primary key length is added to each secondary index’s length again. For
example, a secondary index on an 8 byte field will actually be 16 bytes long if the primary key
also has 8 bytes per entry. A 16 KB data page would provide space for roughly 1,000 index
entries in this scenario. If the primary key is 16 bytes long, it would only be sufficient for about
680 entries, reducing the effectiveness of cache memory.
Single column keys
Depending on the data you intend to store in an InnoDB table, you might consider using a
composite primary key. This means that no single column’s value alone uniquely identifies a
single record but only the combination of several independent columns allows uniqueness.
From a technical point of view, this is perfectly feasible and might even be a good choice from
a semantic point of view.
However, you should very carefully weigh the alternatives because composite keys can quickly
become a burden. The more secondary tables define foreign key relationships with a table
using a composite primary key, the more complicated your queries will become because
whenever you join the two, you have to define a join condition on at least four columns. For
more complex queries with multiple joins, this quickly becomes very hard to maintain and
therefore, carries a great risk of errors that might be hard to find.
In addition, you also have to consider the increased size of each key entry and that the sizes
of all the participating columns must be added.
As general advice, you should definitely consider using a surrogate key when you cannot find
any candidate that fulfills the other criteria just discussed.
Clustered Index
As data is physically stored on disk in the order of the clustered key, similar key values end up
in neighboring locations. This makes clustered indexes very efficient for queries that retrieve
ranges of records by this key. If, for example, the clustered key is a timestamp of some sort,
retrieving all records within a contiguous timespan is likely to require relatively little physical
disk I/O because ideally all requested result rows are stored in the same data page, therefore
only needing a single read operation (which might even be cached). Even if multiple pages
had to be read, this will only require a sequential read operation, which leverages linear disk
read performance best.
Unfortunately, InnoDB does not allow a non-primary key to be clustered—other DBMS do—so
you have to weigh the alternatives and maybe live with a compromise when deciding on the
primary key for your InnoDB t ables.
Speeding up searches for (sub)domains
In a column with domain e-mail addresses, searching for all addresses of a given domain is a
non-trivial task performance-wise. Given the following table structure, the only way to find all
addresses @gmail.com is to use a LIKE query with a wildcard:
SELECT * FROM clients WHERE email LIKE ‘%@gmail.com’;
Of course, storing the address and domain parts in separate columns would solve this
particular problem. But as soon as you were asked for a quick way to find all clients with an
e-mail address from a British provider, you would be out of luck again, resorting to:
SELECT * FROM clients WHERE maildomain LIKE ‘%.co.uk’;
Both queries would cause a full table scan because no index can support the wildcard at the
beginning of the search term.
In this recipe, you will be given a simple but effective approach to enable the use of indexes
for both of the problems just presented. Notice that you will need to make minor adjustments
to the queries sent against the database. This might involve some code adjustments in
your application.
Getting ready
To implement the ideas presented here, you will need to connect to the database server
with your administrative account for the schema modifications. Apart from the database
modifications, application program code changes will be necessary as well and you should
contact an application developer.
How to do it…
- Identify which column is currently used to store domain-related data. In the example,
we will be using the maildomain column of the clients table. - Update this column and reverse the contents of the field like this:
- If not already set up, add an index to the column:
- Change all queries in your application as follows:
UPDATE clients SET maildomain=REVERSE(maildomain);
ALTER TABLE clients ADD INDEX IDXR_MAILDOMAIN(maildomain);
Before:SELECT name, maildomain FROM clients WHERE maildomain LIKE
‘%.co.uk’;After:
SELECT name, REVERSE(maildomain) AS maildomain FROM clients WHERE
maildomain LIKE REVERSE(‘%.co.uk’);
The point here is to reverse the search condition as well as the column in the column
list. SELECT statements using the star placeholder instead of column names need to
be rewritten to reverse the maildomain column.
How it works…
Indexes are designed to speed up queries by sorting the relevant column contents, which
makes finding records with a given search prefix easy.
Searching for all people whose name starts with an “S”, for example, is supported by this
technique. The more characters you provide the more specific the search gets, again
supported ideally by an index.
Domain names are a different story, however, because those belonging together do not share
a common prefix but suffix. There is no immediate way of telling MySQL to create an index
supporting this kind of data.
The first idea that comes to mind to work around this would be to use a query along the
lines of:
SELECT * FROM clients
WHERE REVERSE(maildomain) LIKE ‘ku.oc.%’;
Unfortunately, MySQL—in contrast to other DBMS—can neither use indexes in conjunction with
functions like REVERSE() nor create an index based on a function in the first place. Instead,
it resorts to full-table scans to find the results as soon as it encounters a function call applied
to a column in a query’s WHERE clause. In this case, the REVERSE() function is applied to the
maildomain column.
With a minor adjustment to the way data is stored, this limitation can be alleviated, however:
store the data backwards in the first place!
When inserting new data into the table, we reverse it first:
INSERT INTO clients (maildomain, …)
VALUES (REVERSE(‘example.co.uk’), …);
When retrieving data later, we just need to reapply the same function to get back at the
original data:
SELECT REVERSE(maildomain) FROM clients
WHERE maildomain LIKE REVERSE(‘%.co.uk’);
As now the query condition does not contain a function call on a column anymore, MySQL is
happy to use an index on the maildomain column to speed up the search.
It might seem odd at first that now even with two calls to the REVERSE() function this query
can in fact use an index.
The key point is that MySQL does not have to apply the function on any
column data but only on the constant condition (the ‘%.co.uk’ string)
and later—when the rows have already been fetched—on the already retrieved
reverse column content of maildomain. Both of these are not a problem for
index use.
The query is really executed in two phases. In the first phase, MySQL will have a look at the
condition and check if it can replace any function call with constants. So, when we write;
SELECT REVERSE(maildomain) FROM clients
WHERE maildomain LIKE REVERSE(‘%.co.uk’);
after the first phase, internally the query looks like this:
SELECT REVERSE(maildomain) FROM clients
WHERE maildomain LIKE ‘ku.oc.%’;
In this query, there is no function call left in the condition. So the index on the maildomain
column can be used, speeding up the execution as desired.
There’s more…
If your application typically issues queries that need to retrieve contiguous ranges of
domains—as in the preceding example—you might consider using the reversed domain name
as primary (and therefore clustered) key.
The advantage would be that the related records would be stored closely together on disk, in
the same or adjacent data pages.
However, updating an existing table on its primary key column can be both
very time consuming, as all data rows need to be physically rearranged, and
sometimes complicated to do when foreign key constraints are in place.
See also
- Choosing InnoDB primary key columns
Finding duplicate indexes
Over time database schemata are subject to changes such as index additions and deletions.
It is not uncommon to end up with multiple indexes that are equivalent in terms of query
execution but might be defined with different names or even different columns.
This duplication of indexes has negative consequences for your database:
- Increased size: The more the indexes, the bigger the database.
- Lower performance: Each index has to be updated on modifications of the respective
table, wasting precious I/O and CPU resources. - Increased schema complexity: Schema maintenance and understanding of the
tables and relationships gets more complicated.
For those reasons, you should be concerned about superfl uous indexes.
In this recipe, we will present a way to quickly find out which indexes can be dropped from a
table as they are functionally equivalent (if not necessarily formally identical) to another one.
Getting ready
In order to run the program presented here, you will need a Java Runtime Environment (JRE or
just Java) installed. You can download it for free from http://www.java.com.
Download the Index Analyzer for MySQL from the book’s website.
You will also need login credentials with administrative privileges for the server and the
database you want to analyze.
How to do it…
- Launch the downloaded application by double-clicking its icon. The connection
window will appear. - Enter the connection data for your MySQL server and specify the database to check.
If you like, you can store these settings for later use. - Hit the Connect button. The analysis will begin. Stand by—this might take a minute or
two, depending on the number of tables, columns, and indexes in that database. - When the analysis is complete, review the proposed changes the tool makes. Apart
from the tree-like display, you can use the Generate SQL button to copy ALTER
TABLE statements to either the clipboard or a file that will apply the changes
suggested to the database.
Make sure you do not just blindly execute the proposed statements
against your database!
You must always carefully review anything that an automated tool
suggests you do to your data. No program can replace your professional
judgment about whether or not an index is obsolete or required for
some specific reason beyond the computer’s understanding.
How it works
The Index Analyzer for MySQL tool connects to your database and retrieves information
about the indexes defined in the database you specified. It then checks for indexes that are
redundant compared with one or more of the others. It will detect the following situations:
- Two indexes are completely identical.
- One index is a prefix of a second longer one. As MySQL can use the second one for
the same queries (ignoring the superfl uous columns) the shorter index is redundant.
An index on an InnoDB table is defined so that it ends with the primary key column(s). As
MySQL internally appends the primary key columns, they should be removed from the explicit
definition. The tree display’s root node is the database you selected, followed by the tables
with redundant indexes. For each table, one or more detail nodes describe the analysis
results in terms of which index is made obsolete by which other.
Each node also contains a rough estimate on how much space could be saved by dropping the
redundant index. Note that this is just a ballpark figure to get an idea. In the example earlier,
the actual savings are lower according to the statistics MySQL offers via the SHOW TABLE
STATUS command:
All tables were defined with an extra index on the emp_no column, which was made obsolete
by the primary key. Note that the difference between estimated and actual savings is most
significant. This is because MySQL estimates are based on multiples of the data page size—16
KB—while the Index Analyzer application uses average column lengths.
There’s more…
Apart from the Index Analyzer for MySQL available from this book’s website, there are other
tools available for the same purpose as well. If you do not want to, or cannot, install a Java
Runtime Environment, you might be more content with Maatkit’s mk-duplicate-key-checker. It
is a free command-line tool based on Perl and can be used on a variety of platforms as well.
You can get it from http://www.maatkit.org including the full documentation.