Oracle Warehouse Builder 11g Getting Started
Competing in today’s world requires a greater emphasis on strategy, long-range
planning, and decision making, and this is why businesses are building data warehouses.
Data warehouses are becoming more and more common as businesses have realized
the need to mine the information that is stored in electronic form. Data warehouses
provide valuable insight into the operation of a business and how best to improve it.
Organizations need to monitor these processes, define policy, and at a more strategic
level, define the visions and goals that will move the company forward in the future. If
you are new to data warehousing in general, and to Extract, Transform, and Load
(ETL) in particular, and need a way to get started, the Oracle Warehouse Builder is a
great application to use to build your warehouse. The Oracle Warehouse Builder
(OWB) is a tool provided by Oracle that can be used at every stage of the implementation
of a data warehouse right from the initial design and creation of the table structure to ETL
and data-quality auditing.
We will build a basic data warehouse using Oracle Warehouse Builder. It has the
ability to support all phases of the implementation of a data warehouse from designing
the source and target information, the mappings to map data from source to target,
the transformations needed on the data, and building the code to implementing the
mappings to load the data. You are free to use any or all of the features in your
own implementation.
What This Book Covers
This book is an introduction to the Oracle Warehouse Builder (OWB). This is an
introductory, hands-on book so we will be including in this book the features available in
Oracle Warehouse Builder that we will need to build our first data warehouse.
The chapters are in chronological order to flow through the steps required to build a data
warehouse. So if you are building your first data warehouse, it is a good idea to read
through each chapter sequentially to gain maximum benefit from the book. Those who
have already built a data warehouse and just need a refresher on some basics can skip
around to whatever topic they need at that moment.
We’ll use a fictional toy company, ACME Toys and Gizmos, to illustrate the concepts
that will be presented throughout the book. This will provide some context to the
information presented to help you apply the concepts to your own organization. We’ll
actually be constructing a simple data warehouse for the ACME Toys and Gizmos
company. At the end of the book, we’ll have all the code, scripts, and saved metadata that
was used. So we can build a data warehouse for practice, or use it as a model for building
another data warehouse.
Chapter 1: An Introduction to Oracle Warehouse Builder starts off with a high-level look
at the architecture of OWB and the steps for installing it. It covers the schemas created in
the database that are required by OWB, and touches upon some installation topics to
provide some further clarification that is not necessarily found in the Oracle
documentation. Most installation tasks can be found in the Oracle README files and
installation documents, and so they won’t be covered in depth in this book.
Chapter 2: Defining and Importing Source Data Structures covers the initial task of
building a data warehouse from scratch, that is, determining what the source of the data
will be. OWB needs to know the details about what the source data structures look like
and where they are located in order to properly pull data from them using OWB. This
chapter also covers how to define the source data structures using the Data Object Editor
and how to import source structure information. It talks about three common sources of
data—fl at files, Oracle Databases, and Microsoft SQL Server databases—while
discussing how to configure Oracle and OWB to connect to these sources.
Chapter 3: Designing the Target Structure explains designing of the data warehouse
target. It covers some options for defining a data warehouse target structure using
relational objects (star schemas and snowflake schemas) and dimensional objects
(cubes and dimensions). Some of the pros and cons of the usage of these objects are
also covered. It introduces the Warehouse Builder for designing and starts with the
creation of a target user and module.
Chapter 4: Creating the Target Structure in OWB implements the design of the target
using the Warehouse Builder. It has step-by-step explanations for creating cubes and
dimensions using the wizards provided by OWB.
Chapter 5: Extract, Transform, and Load Basics introduces the ETL process by
explaining what it is and how to implement it in OWB. It discusses whether to use a
staging table or not, and describes mappings and some of the main operators in OWB that
can be used in mappings. It introduces the Warehouse Builder Mapping Editor, which is
the interface for designing mappings.
Chapter 6: ETL: Putting it Together is about creating a new mapping using the Mapping
Editor. A staging table is created with the Data Object Editor, and a mapping is created to
map data directly from the source tables into the staging table. This chapter explains how
to add and edit operators, and how to connect them together. It also discusses operator
properties and how to modify them.
Chapter 7: ETL: Transformations and Other Operators expands on the concept of
building a mapping by creating additional mappings to map data from the staging table
into cube and dimensions. Additional operators are introduced for doing transformations
of the data as it is loaded from source to target.
Chapter 8: Validating, Generating, Deploying, and Executing Objects covers in great
detail the validating of mappings, the generation of the code for mappings and objects,
and deploying the code to the target database. This chapter introduces the Control
Center Service, which is the interface with the target database for controlling this
process, and explains how to start and stop it. The mappings are then executed to actually
load data from source to target. It also introduces the Control Center Manager, which is
the user interface for interacting with the Control Center Service for deploying and
executing objects.
Chapter 9: Extra Features covers some extra features provided in the Warehouse
Builder that can be very useful for more advanced implementations as mappings get
more numerous and complex. The metadata change management features of OWB are
discussed for controlling changes to mappings and objects. This includes the recycle bin,
cutting/copying and pasting objects to make copies or backups, the snapshot feature, and
the metadata loader facility for exporting metadata to a file. Keeping objects
synchronized as changes are made is discussed, and so is the auto-binding of tables to
dimensional objects. Lastly, some additional online references are provided for further
study and reference.
Designing the Target Structure
We have our entire source structures defined in the Warehouse Builder. But before
we can do anything with them, we need to design what our target data warehouse
structure is going to look like. When we have that figured out, we can start mapping
data from the source to the target. So, let’s design our target structure. First, we’re
going to take a look at some design topics related to a data warehouse that are
different from what we would use if we were designing a regular relational database.
We’ll then discuss what our design will look like, and after that we’ll be ready to
move right into creating that design using the Warehouse Builder in the next chapter.
Data warehouse design
When it comes to the design of a data warehouse, there is basically one option
that makes the most sense for how we will structure our database and that is the
dimensional model. This is a way of looking at the data from a business perspective
that makes the data simple, understandable, and easy to query for the business end
user. It doesn’t require a database administrator to be able to retrieve data from it.
When looking at the source databases in the last chapter, we saw a normalized
method of modelling a database. A normalized model removes redundancies in data
by storing information in discrete tables, and then referencing those tables when
needed. This has an advantage for a transactional system because information needs
to be entered at only one place in the database, without duplicating any information
already entered. For example, in the ACME Toys and Gizmos transactional database,
each time a transaction is recorded for the sale of an item at a register, a record
needs to be added only to the transactions table. In the table, all details regarding
the information to identify the register, the item information, and the employee
who processed the transaction do not need to be entered because that information
is already stored in separate tables. The main transaction record just needs to be
entered with references to all that other information.
This works extremely well for a transactional type of system concerned with daily
operational processing where the focus is on getting data into the system. However,
it does not work well for a data warehouse whose focus is on getting data out of the
system. Users do not want to navigate through the spider web of tables that compose
a normalized database model to extract the information they need. Therefore,
dimensional models were introduced to provide the end user with a fl attened structure
of easily queried tables that he or she can understand from a business perspective.
Dimensional design
A dimensional model takes the business rules of our organization and represents
them in the database in a more understandable way. A business manager looking at
sales data is naturally going to think more along the lines of “how many gizmos did
I sell last month in all stores in the south and how does that compare to how many
I sold in the same month last year?” Managers just want to know what the result
is, and don’t want to worry about how many tables need to be joined in a complex
query to get that result. In the last chapter, we saw how many tables would have to
be joined together in such a query just to be able to answer a question like the one
above. A dimensional model removes the complexity and represents the data in a
way that end users can relate to it more easily from a business perspective.
Users can intuitively think of the data for the above question as a cube, and the edges
(or dimensions) of the cube labeled as stores, products, and time frame. So let’s take a
look at this concept of a cube with dimensions, and how we can use that to represent
our data.
Cube and dimensions
The dimensions become the business characteristics about the sales, for example:
- A time dimension—users can look back in time and check various time
periods - A store dimension—information can be retrieved by store and location
- A product dimension—various products for sale can be broken out
Think of the dimensions as the edges of a cube, and the intersection of the
dimensions as the measure we are interested in for that particular combination of
time, store, and product. A picture is worth a thousand words, so let’s look at what
we’re talking about in the following image:
Notice what this cube looks like. How about a Rubik’s Cube? We’re doing a data
warehouse for a toy store company, so we ought to know what a Rubik’s cube is! If
you have one, maybe you should go get it now because that will exactly model what
we’re talking about. Think of the width of the cube, or a row going across, as the
product dimension. Every piece of information or measure in the same row refers to
the same product, so there are as many rows in the cube as there are products. Think
of the height of the cube, or a column going up and down, as the store dimension.
Every piece of information in a column represents one single store, so there are as
many columns as there are stores. Finally, think of the depth of the cube as the time
dimension, so any piece of information in the rows and columns at the same depth
represent the same point in time. The intersection of each of these three dimensions
locates a single individual cube in the big cube, and that represents the measure
amount we’re interested in. In this case, it’s dollar sales for a single product in a
single store at a single point in time.
But one might wonder if we are restricted to just three dimensions with this model.
After all, a cube has only three dimensions— length, width, and depth. Well, the
answer is no. We can have many more dimensions than just three. In our ACME
example, we might want to know the sales each employee has accomplished for the
day. This would mean we would need a fourth dimension for employees. But what
about our visualization above using a cube? How is this fourth dimension going to
be modelled? And no, the answer is not that we’re entering the Twilight Zone here
with that “dimension not only of sight and sound but of mind…” We can think of
additional dimensions as being cubes within a cube. If we think of an individual
intersection of the three dimensions of the cube as being another cube, we can see
that we’ve just opened up another three dimensions to use—the three for that inner
cube. The Rubik’s Cube example used above is good because it is literally a cube of
cubes and illustrates exactly what we’re talking about.
We do not need to model additional cubes. The concept of cubes within cubes was
just to provide a way to visualize further dimensions. We just model our main cube,
add as many dimensions as we need to describe the measures, and leave it for the
implementation to handle.
This is a very intuitive way for users to look at the design of the data warehouse.
When it’s implemented in a database, it becomes easy for users to query the
information from it.
Implementation of a dimensional model in a database
We have seen how a dimensional model is preferred over a normalized model
for designing a data warehouse. Now before we finalize our model for the ACME
Toys and Gizmos data warehouse, let’s look at the implementation of the model
to see how it gets physically represented in the database. There are two options: a
relational implementation and a multidimensional implementation. The relational
implementation, which is the most common for a data warehouse structure, is
implemented in the database with tables and foreign keys. The multidimensional
implementation requires a special feature in a database that allows defining cubes
directly as objects in the database. Let’s discuss a few more details of these two
implementations. But we will look at the relational implementation in greater detail
as that is the one we’re going to use throughout the remainder of the book for our
data warehouse project.
Relational implementation (star schema)
Back in Chapter 2, we saw how ACME’s POS Transactional database and Order
Entry databases were structured when we did our initial analysis. The diagrams
presented showed all the tables interconnected, and we discussed the use of foreign
keys in a table to refer to a row in another table. That is fundamentally a relational
database. The term relational is used because the tables in it relate to each other in
some way. We can’t have a POS transaction without the corresponding register it
was processed on, so those two relate to each other when represented in the database
as tables.
For a relational data warehouse design, the relational characteristics are retained
between tables. But a design principle is followed to keep the number of levels of
foreign key relationships to a minimum. It’s much faster and easier to understand if
we don’t have to include multiple levels of referenced tables. For this reason, a data
warehouse dimensional design that is represented relationally in the database will
have one main table to hold the primary facts, or measures we want to store, such as
count of items sold or dollar amount of sales. It will also hold descriptive information about
those measures that places them in context, contained in tables that are accessed by the
main table using foreign keys. The important principle here is that these tables that are
referenced by the main table contain all the information they need and do not need to
go down any more levels to further reference any other tables.
The ER diagram of such an implementation would be shaped somewhat like a star,
and thus the term star schema is used to refer to this kind of an implementation.
The main table in the middle is referred to as the fact table because it holds the
facts, or measures that we are interested in about our organization. This represents
the cube that we discussed earlier. The tables surrounding the fact table are known
as dimension tables. These are the dimensions of our cube. These tables contain
descriptive information, which places the facts in a context that makes them
understandable. We can’t have a dollar amount of sales that means much to us unless
we know what item it was for, or what store made the sale, or any of a number of
other pieces of descriptive information that we might want to know about it.
It is the job of data warehouse design to determine what pieces of information need
to be included. We’ll then design dimension tables to hold the information. Using the
dimensions we referred to above in our cube discussion as our dimension tables, we
have the following diagram that illustrates a star schema:
Of course our star only has three points, but with a much larger data warehouse of
many more dimensions, it would be even more star-like. Keep in mind the principle
that we want to follow here of not using any more than one level of foreign key
referencing. As a result, we are going to end up with a de-normalized database
structure. We discussed normalization back in Chapter 2, which involved the use
of foreign key references to information in other tables to lessen the duplication
and improve data accuracy. For a data warehouse, however, the query time and
simplicity is of paramount importance over the duplication of data. As for the data
accuracy, it’s a read-only database so we can take care of that up front when we load
the data. For these reasons, we will want to include all the information we need right
in the dimension tables, rather than create further levels of foreign key references.
This is the opposite of normalization, and thus the term de-normalized is used.
Let’s look at an example of this for ACME Toys and Gizmos to get a better
idea of what we’re talking about with this concept of de-normalization. Every
product in our stores is associated with a department. If we have a dimension for
product information, one of the pieces of information about the product would
be the department it is in. In a normalized database, we would consider creating
a department table to store department descriptions with one row for each
department, and would use a short key code to refer to the department record in
the product table.
However, in our data warehouse, we would include that department information,
description and all, right in the product dimension. This will result in the same
information being duplicated for each product in the department. What that buys
us is a simpler structure that is easier to query and more efficient for retrieving
information from, which is key to data warehouse usability. The extra space we
consume in repeating the information is more than paid for in the improvement in
speed and ease of querying the information. That will result in a greater acceptance
of the data warehouse by the user community who now find it more intuitive and
easier to retrieve their data.
In general, we will want to de-normalize our data warehouse implementation
in all cases, but there is the possibility that we might want to include another
level—basically a dimension table referenced by another dimension table. In most
cases, we will not need nor want to do this and instances should be kept to an
absolute minimum; but there are some cases where it might make sense.
This is a variation of the star schema referred to as a snowflake schema because
with this type of implementation, dimension tables are partially normalized to pull
common data out into secondary dimension tables. The resulting schema diagram
looks somewhat like a snowfl ake. The secondary dimension tables are the tips of
the snowfl ake hanging off the main dimension tables in a star schema.
In reality, we’d want at the most only one or two of the secondary dimension
tables; but it serves to illustrate the point. A snowfl ake dimension table is really not
recommended in most cases because of ease-of-use and performance considerations,
but can be used in very limited circumstances. The Kimball book on Dimensional
Modelling was referred to at the beginning of Chapter 2. This book discusses some
limited circumstances where it might be acceptable to implement a snowfl ake design,
but it is highly discouraged for most cases.
Let’s now talk a little bit about the multidimensional implementation of a
dimensional model in the database, and then we’ll design our cube and dimensions
specifically for the ACME Toys and Gizmos Company data warehouse.
Multidimensional implementation (OLAP)
A multidimensional implementation or OLAP (online analytic or analytical
processing) requires a database with special features that allow it to store cubes as
actual objects in the database, and not just tables that are used to represent a cube
and dimensions. It also provides advanced calculation and analytic content built
into the database to facilitate advanced analytic querying. Oracle’s Essbase product
is one such database and was originally developed by Hyperion. Oracle recently
acquired Hyperion, and is now promoting Essbase as a tool for custom analytics
and enterprise performance management applications. The Oracle Database
Enterprise Edition has an additional feature that can be licensed called OLAP that
embeds a full-featured OLAP server directly in an Oracle database. This is an option
organizations can leverage to make use of their existing database.
These kinds of analytic databases are well suited to providing the end user with
increased capability to perform highly optimized analytical queries of information.
Therefore, they are quite frequently utilized to build a highly specialized data
mart, or a subset of the data warehouse, for a particular user community. The data
mart then draws its data to load from the main data warehouse, which would be a
relational dimensional star schema. A data warehouse implementation may contain
any number of these smaller subset data marts.
We’ll be designing dimensionally and implementing relationally, so let’s now
design our actual dimensions that we’ll need for our ACME Toys and Gizmos data
warehouse, and talk about some issues with the fact data (or cube) that we’ll need.
This will make the concepts we just discussed more concrete, and will form the basis
for the work we do in the rest of the book as we implement this design. We’ll then
close out this chapter with a discussion on designing in the Warehouse Builder,
where we’ll see how it can support either of these implementations.
We have seen the word dimension used in describing both a relational
implementation and a multidimensional implementation. It is even in the
name of the second implementation method we discussed, so why does
the relational method use it also? In the relational case, the word is used
more as an adjective to describe the type of table taken from the name of
the model being implemented; whereas in the multidimensional model
it's more a noun, referring to the dimension itself that actually gets
created in the database. In both cases, the type of information conveyed
is the same—descriptive information about the facts or measures—so its
use in both cases is really not contradictory. There is a strong correlation
between the fact table of the relational model and the cube of the
dimensional model, and between the dimension tables of the relational
model and the dimensions of the dimensional model.