Creating your MySQL Database
MySQL, launched in 1995, has become the most popular open source database system. The
popularity of MySQL and phpMyAdmin has allowed many non-IT specialists to build dynamic
websites with a MySQL backend. This book is a short but complete guide showing beginners how
to design good data structures for MySQL. It teaches how to plan the data structure and how to
implement it physically using MySQL’s model.
What This Book Covers
Chapter 1 introduces the concept of MySQL, and discusses MySQL’s growing popularity and its
impact as a powerful tool. This chapter gives us a brief overview of the relational models and
Codd’s rules, which are required for designing purposes. A brief introduction to our case study —
“car dealer” is provided at the end.
Chapter 2 shows how to deal with the raw data information that comes from the users or other
sources, and the techniques that can help us build a comprehensive data collection. Also, this
chapter covers the exact limits of the analyzed system, how one should gather documents, and
interview activities for our case study.
Chapter 3 emphasises on transforming the data elements gathered in the collection process into a
cohesive set of column names. The concept of data naming is also discussed in this chapter.
Chapter 4 provides the technique of grouping column names into tables. Rules for table layout, the
concepts such as primary key, unique key, data redundancy, and data dependency are covered in
this chapter.
Chapter 5 presents various techniques for improving our data structure in terms of security,
performance, and documentation. The final data structure for the car dealer’s case study is
provided at the end.
Chapter 6 covers a supplemental case study about an airline system. This case study involves
various steps such as gathering documents, preparing preliminary list of data elements, preparing a
list of tables, sample values, and queries for the airline system..
Data Naming
In this chapter, we focus on transforming the data elements gathered in the collection
process into a cohesive set of column names. Although this chapter has sections for
the various steps we should accomplish for efficient data naming, there is no specific
order in which to apply those steps. In fact, the whole process is broken down into
steps to shed some light on each one in turn, but the actual naming process applies
all those steps at the same time. Moreover, the division between the naming and
grouping processes is somewhat artificial – you’ll see that some decisions about
naming infl uence the grouping phase, which is the subject of the next chapter.
Data Cleaning
Having gathered information elements from various sources, some cleaning work is
appropriate to improve the significance of these elements. The way each interviewee
named elements might be inconsistent; moreover, the significance of a term can vary
from person to person. Thus, a synonym detection process is in order.
Since we took note of sample values, now it is time to cross-reference our list of
elements with those sample values. Here is a practical example, using the car’s
id number.
When the decision is made to order a car – a Mitsou 2007 – the office clerk opens
a new file and assigns a sequential number dubbed car_id number to the file, for
instance, 725. At this point, no confirmation has been received from any car supplier,
so the clerk does not know the future car’s serial number – a unique number stamped
on the engine and other critical parts of the vehicle.
This car’s id number is referred to as the car_number by the office clerk. The store
assistants who register car movements use the name stock_number. But using this
car number or the stock number is not meaningful for financing and insurance
purposes; the car’s serial number is used instead for that purpose.
At this point, a consensus must be reached by convincing users about the importance
of standard terms. It must become clear to everyone that the term car_number is not
precise enough to be used, so it will be replaced by car_internal_number in the
data elements list, probably also in any user interface (UI) or report.
It can be argued that car_internal_number should be replaced by something more
appropriate; the important point here is we merged two synonyms: car_number and
stock_number, and established the difference between two elements that looked
similar but were not, eliminating a source of confusion.
Therefore we end up with the following elements:
- Car_serial_number
- Car_internal_number (former car id number and stock number)
Eventually, when dealing with data grouping, another decision will have to be taken:
to which number, serial or internal, do we associate the car’s physical key number.
Subdividing Data Elements
In this section, we try to find out if some elements should be broken into more simple
ones. The reason for doing so is that, if an element is composed of many parts,
applications will have to break it for sorting and selection purposes. Thus it’s better
to break the elements right now at the source. Recomposing it will be easier at the
application level.
Breaking the elements provides more clarity at the UI level. Therefore, at this
level we will avoid (as much as possible) the well-known last-name/first-name
inversion problem.
As an example for this problem, let’s take the buyer’s name. During the interview, we
noticed that the name is expressed in various ways on the forms:
Form | How the name is expressed |
---|---|
Delivery certificate | Mr Joe Smith |
Sales contract | Smith, Joe |
We notice that
- There is a salutation element, Mr
- The element name is too imprecise; we really have a first name and a last name
- On the sales contract, the comma after our last name should really be
excluded from the element, as it’s only a formatting character
As a result, we determine that we should sub-divide the name into the following
elements:
- Salutation
- First name
- Last name
Sometimes it’s useful to sub-divide an element, sometimes it’s not. Let’s consider
the date elements. We could sub-divide each one into year, month, and day (three
integers) but by doing so, we would lose the date calculation possibilities that
MySQL offers. Among those are, finding the week day from a date, or determining
the date that falls thirty days after a certain date. So for the date (and time), a single
column can handle it all, although at the UI level, separate entry fields should be
displayed for year, month, and day. This is to avoid any possibility of mix-up and
also because we cannot expect users to know about what MySQL accepts as a valid
date. There is a certain latitude in the range of valid values but we can take it for
granted that users have unlimited creativity, regarding how to enter invalid values.
If a single field is present on the UI, clear directions should be provided to help
with filling this field correctly.
Data Elements Containing Formatting Characters
The last case we’ll examine is the phone number. In many parts of the world, the
phone number follows a specific pattern and also uses formatting characters for
legibility. In North America, we have a regional code, an exchange number, and
phone number, for example, 418-111-2222; an extension could possibly be appended
to the phone number. However, in practice only the regional code and extension
are separated from the rest into data elements of their own. Moreover, people often
enter formatting characters like (418) 111-2222 and expect those to be output back.
So, a standard output format must be chosen, and then the correct number of
sub-elements will have to be set into the model to be able to recreate the
expected output.
Data that are Results
Even though it might seem natural to have a distinct element for the total_price of
the car, in practice this is not justified. The reason is that the total price is a computed
result. Having the total price printed on a sales contract constitutes an output. Thus,
we eliminate this information in the list of column names. For the same reason, we
could omit the tax column because it can be computed.
By removing the total price column, we could encounter a pitfall. We have to be sure
that we can reconstruct this total price from other sub-total elements, now and in the
future. This might not be possible for a number of reasons:
- The total price includes an amount located in another table, and this table
will change over time (for example, the tax rate). To avoid this problem, see
the recommendations in the Scalability over Time section in Chapter 4. - This total price contains an arbitrary value, due to some exceptional cases,
for example, where there is a special sale, and the rebate was not planned
in the system, or when the lucky buyer is the brother-in-law of the general
manager! In this case, a decision can be made: adding a new column
other_rebate.
Data as a Column’s or Table’s Name
Now is the time to uncover what is perhaps the least known of the data naming
problems: data hidden in a column’s or even a table’s name.
We had one example of this in Chapter 1. Remember the qty_2006_1 column name.
Although this is a commonly seen mistake, it’s a mistake nonetheless. We clearly
have two ideas here, the quantity and the date. Of course, to be able to use just two
columns, some work will have to be done regarding the keys – this is covered in
Chapter 4. For now, we should just use elements like quantity and date in our
elements list, avoiding representing data in a column’s name.
To find those problematic cases in our model, a possible method is to look for
numbers. Column names like address1, address2 or phone1, phone2 should
look suspicious.
Now, have a look in Chapter 2 at the data elements we got from our store assistant.
Can you find a case of data being hidden in a column name?
If you have done this exercise, you might have found many past participles hidden
into the column names, like ordered, arrived, and washed. These describe the events
that happen to a car. We could try to anticipate all possible events but it might prove
impossible. Who knows when a new column car_provided_with_big_ribbon will
be needed? Such events, if treated as distinct column names, must be addressed by
- A change in the data structure
- A change in the code (UI and reports)
To stay fl exible and avoid the wide-table syndrome, we need two tables: car_event
and event.
Here are the structure and sample values for those tables:
CREATE TABLE `event` (
`code` int(11) NOT NULL,
`description` char(40) NOT NULL,
PRIMARY KEY ('code')
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `event` VALUES (1, 'washed');
The usage of backticks here ('event'), although not standard
SQL, is a MySQL extension used to enclose and protect
identifiers. In this specific case, it could help us with
MySQL 5.1 in which the event keyword is scheduled to
become part of the language for some another purpose
(CREATE EVENT). At the time of writing, beta version
MySQL 5.1.11 accepts CREATE TABLE event, but it might
not always be true.
The following image shows sample values entered into the event table from within
the Insert sub-page of phpMyAdmin:
CREATE TABLE `car_event` (
`internal_number` int(11) NOT NULL,
`moment` datetime NOT NULL,
`event_code` int(11) NOT NULL,
PRIMARY KEY ('internal_number')
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `car_event` VALUES (412, '2006-05-20 09:58:38', 1);
Again, sample values are entered via phpMyAdmin:
Data can also hide in a table name. Let’s consider the car and truck tables. They
should probably be merged into a vehicle table, since the vehicle’s category – truck,
car, and other values like minivan is really an attribute of a particular vehicle.
We could also find another case for this table name problem: a table named
vehicle_1996.
Planning for Changes
When designing a data structure, we have to think about how to manage its growth
and the possible implications of the chosen technique.
Let’s say an unplanned car characteristic – the weight – has to be supported. The
normal way of solving this is to find the proper table and add a column. Indeed, this
is the best solution; however, someone has to alter the table’s structure, and probably
the UI too.
The free fields technique, also called second-level data or EAV (Entity-Attribute-
Value) technique is sometimes used in this case. To summarize this technique, we
use a column whose value is a column name by itself.
Even if this technique is shown here, I do not recommend
using it, for the reasons explained in the Pitfalls of the Free
Fields Technique section below.
The difference between this technique and our car_event table is that, for
car_event, the various attributes can all be related to a common subject, which is
the event. On the contrary, free fields can store any kind of dissimilar data. This
might also be a way to store data specific to a single instance or row of a table.
In the following example, we use the car_free_field table to store unplanned
information about the car whose internal_number is 412. The weight and special
paint had not been planned, so the UI gave the user the chance to specify which
information they want to keep, and the corresponding value. We see here a
screenshot from phpMyAdmin but most probably, another UI would be presented
to the user – for example the salesperson who might not be trained to play at the
database level.
CREATE TABLE `car_free_field` (
`internal_number` int(11) NOT NULL,
`free_name` varchar(30) NOT NULL,
`free_value` varchar(30) NOT NULL,
PRIMARY KEY ('internal_number','free_name')
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `car_free_field` VALUES (412, 'weight', '2000');
INSERT INTO `car_free_field` VALUES (412, 'special paint needed',
'gold');
Pitfalls of the Free Fields Technique
Even if it’s tempting to use this kind of table for added fl exibility and to avoid user
interface maintenance, there are a number of reasons why we should avoid using it.
- It becomes impossible to link this “column” (for example the special paint
needed) to a lookup table containing the possible colors, with a foreign
key constraint. - The free_value field itself must be defined with a generic field type like
VARCHAR whose size must be wide enough to accommodate all values for all
possible corresponding free_name values. - It prevents easy validation (for a weight, we need a numeric value).
- Coding the SQL queries on these free fields becomes more complex – i.e.
SELECT internal_number from car_free_field where
free_name = ‘weight’ and free_value > 2000.
Naming Recommendations
Here we touch a subject that can become sensitive. Establishing a naming convention
is not easily done, because it can interfere with the psychology of the designers.
Designer’s Creativity
Programmers and designers usually think of themselves as imaginative, creative
people; UI design and data model are the areas in which they want to express
those qualities. Since naming is writing, they want to put a personal stamp to the
column and table names. This is why working as a team for data structure design
necessitates a good dose of humility and achieves good results only if everyone is a
good team player.
Also, when looking at the work of others in this area, there is a great temptation to
improve the data elements names. Some discipline in the standardization has to be
applied and all the team members have to collaborate.
Abbreviations
Probably because older database systems had severe restrictions about the
representation of variables and data elements in general, the practice of abbreviating
has been taught for many years and is followed by many data structure designers
and programmers. I used programming languages that accepted only two characters
for variable names – we had to extensively comment the correspondence between
those cropped variables and their meaning.
Nowadays, I see no valid reasons for systematically abbreviating all column and
table names; after all, who will understand the meaning of your T1 table or your
B7 field?
Clarity versus Length: an Art
A consistent style of abbreviations should be used. In general, only the most
meaningful words of a sentence should be put into a name, dropping prepositions,
and other small words. As an example, let’s take the postal code. We could express
this element with different column names:
- the_postal_code
- pstl_code
- pstlcd
- postal_code
I recommend the last one for its simplicity.
Suffixing
Carefully chosen suffixes can add clarity to column names. As an example,
for the date of first payment element, I would suggest first_payment_date. In fact,
the last word of a column name is often used to describe the type of content – like
customer_no, color_code, interest_amount.
The Plural Form
Another point of controversy for table names: should we use the plural form cars
table? It can be argued that the answer is yes because this table contains many cars
– in other words, it is a set. Nonetheless, I tend not to use the plural form for the
simple reason that it adds nothing in terms of information. I know that a table is a
set, so using the plural form would be redundant. It can be said also that each row
describes one car.
If we consider the subject on the angle of queries, we can draw different
conclusions depending on the query. A query referring to the car table –
select car.color_code from car where car.id = 34 is more elegant if the
plural form is not used, because the main idea here is that we retrieve one car
whose id equals 34. Some other queries might make more sense with a plural, like
select count(*) from cars.
As a conclusion for this section, the debate is not over, but the most important point
is to choose a form and be consistent throughout the whole system.
Naming Consistency
We should ensure that a data element that is present in more than one table is
represented everywhere by the same column name. In MySQL, a column name does
not exist by itself; it is always inside a table. This is why, unfortunately, we cannot
pick up consistent column names from, say, a pool of standardized column names
and associate it with the tables. Instead, during each table’s creation we indicate
the exact column names we want and their attributes. So, let’s avoid using different
names – internal_number and internal_num when they refer to the same reality.
An exception for this: if the column’s name refers to a key in another table – the
state column – and we have more than one column referring to it like
state_of_birth, `state_of_residence`.
MySQL’s Possibilities versus Portability
MySQL permits the use of many more characters for identifiers – database, table,
and column names than its competitors. The blank space is accepted as are accented
characters. The simple trade-off is that we need to enclose such special names
with back quotes like ‘state of residence’. This procures a great liberty in the
expression of data elements, especially for non-English designers, but introduces a
state of non-portability because those identifiers are not accepted in standard SQL.
Even some SQL implementations only accept uppercase characters for identifiers.
I recommend being very prudent before deciding to include such characters.
Even when staying faithful to MySQL, there has been a portability issue between
versions earlier than 4.1 when upgrading to 4.1. In 4.1.x, MySQL started to represent
identifiers internally in UTF-8 code, so a renaming operation had to be done to
ensure that no accented characters in the database, table, column and constraint
names were present before the upgrade. This tedious operation is not very practical
in a 24/7 system availability context.
Table Name into a Column Name
Another style I often see: one would systematically add the table name as a prefix
to every column name. Thus the car table would be comprised of the columns:
car_id_number, car_serial_number. I think this is redundant and it shows its
inelegance when examining the queries we build:
select car_id_number from car
is not too bad, but when joining tables we get a query such as
select car.car_id_number,
buyer.buyer_name
from car, buyer
Since at the application level, the majority of queries we code are multi-tables like
the one used above, the clumsiness of using a table name even abbreviated as part of
column names becomes readily apparent. Of course, the same exception we saw in
the Naming Consistency section applies: a column – foreign key – referring to a lookup
table normally includes this table’s name as part of the column’s name. For example,
in the car_event table, we have event_code which refers to the code column in
table event.
Summary
To get a clear and understandable data structure, proper data elements naming is
important. We examined many techniques to apply in order to build consistent table
and column names.