Designing the ACME data warehouse
We have chosen to use a dimensional model for our data warehouse, so we’ll define
a cube with dimensions to represent our information. Let’s lay out a basic structure
of information we want each to contain. We’ll begin with the dimensions, since they
are going to provide the context for the measure(s) we will want to store in our cube.
Identifying the dimensions
To know what dimensions to design for, we need to know what business process
we’re going to be supporting with our data warehouse. Is management concerned
with daily inventory? How about daily sales volume? This information will guide us
in selecting the correct parts of the business to model with our dimensions.
We are going to support the sales managers in managing the daily sales of the ACME
Toys and Gizmos Company, and they have already given us an example of the kind
of question they want answered from their data warehouse, as we saw earlier. We
used that to illustrate the cube concept and to show a star schema representation
of it, so the information shows us the dimensions we need. Since management is
concerned with daily sales, we need some kind of date/time dimension that will
provide us the context for the sales data indicating what day the sale transaction
took place.
We can pretty much be guaranteed that we will need a
time/date type dimension for any data warehouse we
design, since one of the main features of data warehouses
is to provide time-series type analytical query capabilities
(as we talked about earlier).
Are we going to need both the time and the date in this dimension, or will just the
date be sufficient? We can get an answer to this question by also looking back at
our business process, which showed that management is concerned with daily sales
volume. Also, the implementation of the time dimension in OWB does not include
the time of day since it would have to include 24 hours of time values for each day
represented in the dimension due to the way it implements the dimension. In the
future if time is needed, there are options for creating a separate dimension just
for modelling time of day values. For our initial design we’ll call our time related
dimension a Date dimension just for added clarity.
Another dimension we have included is to model the product information. Each
sale transaction is for a particular product, and management has indicated they
are concerned about seeing how well each product is selling. So we will include a
dimension that we shall call Product. At a minimum we need the product name, a
description of the product, and the cost of the product as attributes of our product
dimension—so we’ll include those in our logical model.
So far we have a Date dimension to represent our time series and a Product
dimension to represent the items that are sold. We could stop there. Management
would then be able to query for sales data for each day for each product sold by
ACME Toys and Gizmos, but they wouldn’t be able to tell where the sale took place.
Another key piece of information the management would like to be able to retrieve
is how well the stores are doing compared to each other for daily sales. Unless we
include some kind of a location dimension, they will not be able to tell that. That
is why we have included a third dimension called Store. It is used to maintain the
information about the store that processed the sales transaction. For attributes of
the store dimension, we can include the store name and address at a minimum to
identify each store.
These dimensions should be enough to satisfy the management’s need for querying
information for this particular business process—the daily sales. We could certainly
include a large number of other dimensions, but we’ll stop here to keep this simple
for our first data warehouse. We can now consider designing the cube and what
information to include in it.
Designing the cube
In the case of the ACME Toys and Gizmos Company, we have seen that the main
measure the management is concerned about is daily sales. There are other numbers
we could consider such as inventory numbers: How much of each item is on hand?
However, the inventory is not directly related to daily sales and wouldn’t make
sense here. We can model an inventory system in a data warehouse that would be
separate from the sales portion. But for our purpose, we’re going to model the sales.
Therefore, our main measure is going to be the dollar amount of sales for each item.
A very important topic to consider at this point is what will be the grain of the
measure—the sales data—that we’re going to store in our cube? The grain (or
granularity) is the level that the sales number refers to. Since we’re using sales as
the measure, we’ll store a sales number; and from our dimensions, we can see that
it will be for a given date in a given store and for a given product. Will that number
be the total of all the sales for that product for that day? Yes, so it satisfies our design
criteria of providing daily sales volume for each product. That is the smallest and
lowest level of sales data we want to store. This is what we mean by the grain or
granularity of the data.
Levels/hierarchies
A dimensional model is naturally able to handle this concept of the
different levels of data by being able to model a hierarchy within a
dimension. The time/date dimension is an easy example of using of
various levels. Add up the daily totals to get the totals for the month,
and add up 12 monthly totals to get the yearly sales. The time/date
dimension just needs to store a value to indicate the day, month, and
year to be able to provide a view of the data at each of those levels.
Combining various levels together then defines a hierarchy. By storing
data at the lowest level, we make available the data for summing at
higher levels. Likewise, from a higher level, the data is then available
to drill down to view at a lower level. If we were to arbitrarily decide
to store the data at a higher level, we would lose that fl exibility.
We'll discuss this further in the next chapter when we build our time
dimension in the Warehouse Builder.
In this case, we have a source system—the POS Transactional system—that
maintains the dollar amount of sales for each line item in each sales transaction
that takes place. This can provide us the level of detail we will want to capture
and maintain in our cube, since we can definitely capture sales for each product at
each store for each day. We have found out that the POS Transactional system also
maintains the count of the number of a particular item sold in the transaction. This is
an additional measure we will consider storing in our cube also, since we can see that
it is at the same grain as the total sales. The count of items would still pertain to that
single transaction just like the sales amount, and can be captured for each product,
store, and even date.
The only other pieces of information our cube is going to contain are pointers to
the dimensions. In the relational model, the fact table would contain columns for
the dollar amount, the quantity, the unit cost, and then foreign keys for each of the
dimension tables.
There are times when it's valid in dimensional design to include more
descriptive information right in the cube, rather than create a dimension
for it. There may be some particularly descriptive piece of information
that stands all by itself, which is not associated with anything else or
whose additional descriptive information has already been included in
other dimensions. In that case, it wouldn't make sense to create a whole
dimension just for it; so it is included directly in the fact table or cube.
This is referred to as a degenerate dimension. It is explained in more
detail in the Kimball book on dimensional modelling we talked about
earlier. There are many other aspects to dimensional design that we don't
have the space to cover here, but are covered in the Kimball book in
more detail. It would be a good idea for you to read this book or a similar
one to get a better understanding of the detailed dimensional modelling
concepts such as this.
Our design is drawn out in a star schema configuration showing the cube, which is
surrounded by the dimensions with the individual items of information (attributes)
we’ll want to store for each. It looks like the following:
OK, we now have a design for our data warehouse. It’s time to see how OWB can
support us in entering that design and generating its physical implementation in
the database.
Data warehouse design in OWB
The Warehouse Builder contains a number of objects, which we can use in designing
our data warehouse, that are either relational or dimensional. OWB currently
supports designing a target schema only in an Oracle database, and so we will find
the objects all under the Oracle node in the Project Explorer. Let’s launch Design
Center now and have a look at it. But before we can see any objects, we have to have
an Oracle module defined to contain the objects. If you’ve been following along and
working through the examples in this book, so far you should have one module
already defined for the ACME web site orders database—ACME_WS_ORDERS. We
created this in the last chapter when we imported our metadata from that source.
If that is the case, our Project Explorer window will look similar to the following:
Creating a target user and module
We need a different module to create our target objects in. So before going any
further, let’s create a new module in the Project Explorer for our target to hold our
data warehouse design objects. However, before we can do that, we should have
a target schema defined in the database that will hold our target objects when we
deploy them.
So far we have discussed many different components such as the
repository, workspaces, the design center, and so on. So, it can be
confusing to know exactly where our main data warehouse is going
to be located. The target schema is going to be the main location for the
data warehouse. When we talk about our "data warehouse" after we
have it all constructed and implemented, the target schema is what we
will be referring to. Amid all these different components we discussed
that compose the Warehouse Builder, the target is where the actual data
warehouse will be built. Our design will be implemented there, and
the code will be deployed to that schema by OWB to load the target
structure with data from the sources.
Every target module must be mapped to a target user schema. Back in Chapter 1,
when we ran the Repository Assistant to create the repository and workspace, we
created the acmeowb user as the repository owner and mentioned that this user can
be a deployment target for our data warehouse. However, it does not have to be the
target user. It’s a good idea to create a separate user schema to become the target
so that user roles in our database can be kept separate. Using the OWB repository
owner schema would mean our target data warehouse would have to be on the same
database server as our repository. In large installations, that will most likely not be
the case. So for maximum fl exibility, we’re going to create a separate user schema.
In our case, that user will be created in the same database as the repository; but it
can be moved to another database easily if we expand and add more servers.
Create a target user
There are a couple of ways we can go about creating our target user—create the
user directly in the database and then add to OWB, or use OWB to physically create
the user. If we have to create a new user, and if it’s on the same database as our
repository and workspaces, it’s a good idea to use OWB to create the user, especially
if we are not that familiar with the SQL command to create a user. However, if our
target schema were to be in another database on another server, we would have to
create the user there. It’s a simple matter of adding that user to OWB as a target,
which we’ll see in a moment. Let’s begin in the Design Center under the Global
Explorer. We talked about that Global Explorer back in our introduction to the
Design Center in Chapter 2. There we said it was for various objects that pertained
to the workspace as a whole.
One of those object types is a Users object that exists under the Security node as
shown here:
Right-click on the Users node and select New… to launch the Create User dialog
box as shown here:
With this dialog box, we are creating a workspace user. We create a workspace user
by selecting a database user that already exists or create a new one in the database. If
we already had a target user created in the database, this is where we would select it.
We’re going to click on the Create DB User… button to create a new database user.
We need to enter the system username and password as we need a user with DBA
privileges in the database to be able to create a database user. We then enter a
username and password for our new user. As we like to keep things basic, we’ll call
our new user ACME_DWH, for the ACME data warehouse. We can also specify the
default and temporary tablespace for our new user, which we’ll leave at the defaults.
The dialog will appear like the following when completely filled in:
The new user will be created when you click on the OK button, and will appear in
the righthand window of the Create User dialog already selected for us. Click on the
OK button and the user will be registered with the workspace, and we’ll see the new
username if we expand the Users node under Security in the Global Explorer. We
can continue with creating our target module now that we have a user defined in the
database to map to.
Notice that the previous dialog boxes don't have any way to specify the
database location information. This is because it creates the user on the
local database we were connected to when we logged into the Design
Center, which is the location of our repository and workspaces. Due to
this, this method can only be used to create the user if it is on the local
database. In the next section where we create our target module, we'll
get to specify the location and that dialog box will allow us to specify
a remote database if needed.
Create a target module
We’ll follow the same steps as we did in the last chapter where we created the
ACME_WS_ORDERS module. Right-click on the Oracle object under Databases and
select New… from the pop-up menu to launch the Create Module Wizard and step
through the process. We’ll name this module ACME_DWH for ACME Data Warehouse.
An important difference between creating this module and the module
we created in the last chapter is that we need to designate this one as a
target module. On the Step 1 of 2 screen, be sure to select Warehouse
Target as the type rather than Data Source.
Create the location in the next step, which is similar to how we created it in the last
chapter, but be sure to specify the database that is our target database instead of the
source we used last time. If we’re creating our own test system, the source location
may very well be the same as our target. But in real-world situations, it will likely be
in a different database on a different server. If we had created a target user schema
on a different database, this is the point at which we would be able to enter the
connection information for that user in order to associate our target module with
that user and make it a target.
For reference, the location screen should look like the following for defining the
location of the target module:
We have specified the User Name to be the user we just created for this very
purpose in the previous section.
In this dialog box, we can see the location information (Host, Port
and Service Name) that we can use to specify a user in another
database if needed. If our user were not in this database, we would
just enter his or her appropriate host and port for the location and
the service name of that remote database.
Now that we have our target database schema and a target module defined, which
is associated with a location pointing to that target schema, we will now have
two Oracle modules under our Oracle object in Project Explorer. We can continue
our discussion of the design objects available to us in the Warehouse Builder for
designing our database. First, let’s make sure we save our work so far by using the
Ctrl+S key combination or by selecting Design | Save All from the main menu.
OWB design objects
Looking at our Project Explorer window with our target Oracle module expanded,
we can see a number of objects that are available to us as shown here:
There are objects that are relational such as Tables, Views, Materialized Views,
and Sequences. Also, there are dimensional objects such as Cubes and Dimensions.
We just discussed relational objects versus dimensional objects. We have decided to
model our database dimensionally and this will dictate the objects we create. From the
standpoint of providing the best model of our business rules and representing what
users want to see, the dimensional method is the way to go as we already discussed.
Most data warehouse implementations we encounter will use a dimensional design.
It just makes more sense for matching the business rules the users are familiar with
and providing the types of information the user community will want to extract from
the database.
We are thinking dimensionally in our design, but what about the underlying
physical implementation? We discussed the difference between the relational and
multidimensional physical implementation of a database, and now it’s time to see
how we will handle that here. The Warehouse Builder can help us tremendously
with that because it has the ability to design the objects logically using cubes and
dimensions in a dimensional design. It also has the ability to implement them
physically in the underlying database as either a relational structure or a dimensional
structure simply by checking a box.
In general, which option should be chosen? The relational
implementation is best suited to large amounts of data that tend to
change more frequently. For this reason, the relational implementation
is usually chosen for the main data warehouse schema by most
implementers of a data warehouse. It is much better suited to handling
the large volumes of data that are imported frequently into the data
warehouse. The multidimensional implementation is better suited to
applications where heavy analytic processing is required, and so is a
good candidate for the data marts that will be presented to users.
To be able to implement the design physically as a dimensional implementation with
cubes and dimensions, we need a database that is designed specifically to support
OLAP as we discussed previously. If that is not available, then the decision is made
for us. In our case, when we installed the Oracle database in Chapter 1, we installed
the Enterprise Edition with default options, and that includes the OLAP feature
in the database, so we have a choice to make. Since we’re installing our main data
warehouse target schema, we’ll choose the relational implementation.
For a relational implementation, the Warehouse Builder actually provides us two
options for implementing the database: a pure relational option and the relational
OLAP option. If we were to have the OLAP feature installed in our database, we
could choose to still have the cubes and dimensions implemented physically in
a relational format. We could have it store metadata in the database in the OLAP
catalog, and so multidimensional features such as aggregations would be available
to us. We could take advantage of the relational implementation of the database for
handling large volumes of data, and still implement a query or reporting tool such
as Oracle Discoverer to access the data that made use of the OLAP features. The pure
relational option just depends on whether we choose to deploy only the data objects
and not the OLAP metadata. In reality, most people choose either the pure relational
or the multidimensional. If they want both, they implement separate data marts. In
fact, the default when creating dimensional objects and selecting relational for the
implementation is to only deploy data objects.
Just to be clear, does all this mean that if we haven’t paid for the OLAP feature for
our database, we can only design our data warehouse using the relational objects;
and therefore must our decision to design dimensionally change? The answer to
that would be an emphatic no, since we just mentioned how OWB will let us design
dimensional objects, cubes and dimensions, and then implement them physically in
the database as relational objects. The benefit is that the same dimensional design
can be implemented at a later time in an OLAP database just by changing a single
setting. There are features of the Warehouse Builder for handling dimensional
features automatically for us, such as levels, surrogate keys, and slowly changing
dimensions (all of which we’ll talk about later) that designing dimensionally
provides us. We would have to implement these manually if we designed our own
tables. Most people who use the Warehouse Builder will use it in that way, so we’ll
definitely want to make use of that feature to maximize the usefulness of the tools to
us. This provides us with fl exibility and it is the way we are going to proceed with
our design. We’ll design dimensionally using a cube and dimensions, and then can
implement it either relationally or dimensionally when we’re ready.