Q & A: What’s a Data Warehouse

Q: Can you describe the term “Data Warehouse”?

A: Well, to understand the data warehouse you have to understand the problem it tries to solve.  Imagine that in your company you have collected disparate, repeating, and inaccurate data.  Your information is basically a mess.  To go into that operational data and start changing it would be complex, mainly because it would disrupt your day-to-day business.  (Some companies have actually stopped developing new applications and reserved a whole year just to do data cleanup. That had determined that the ROI was worth it!)

Suppose you see that employee Mary Wilson now has the name “Mary Smith”(perhaps due to marriage or divorce) and so you change it to be more accurate.  Well what you perhaps didn’t realize is that Marry Smith is referenced in many databases, files and programs.  By changing her marital status in one database you have created accuracy and also have created confusion and contradiction.

For there are applications and programs that display Mary Smith as “Mary Smith for her current name and have a blank in her maiden name.  These programs have no way of knowing that the new name of Wilson should be put in the current name, while “Smith” should go in the maiden name field.  So the result is that in correcting your main database, you’ve made that application less accurate.  This in turn makes reports that look across the whole employee spectrum look wrong or confusing.

For this reason, the data warehouse was invented.  The idea is that you write programs to read your bad data from all databases, correct or clean any anomalies, and put the correct data in this new data warehouse.  So now the data warehouse has the correct version and current status of all employees.  People can stop looking at the old systems information and just look at the data warehouse for the true story.

Q: So does this mean that we can throw away the old databases?

A: No, we leave the old databases and files the way they are.  The reason is that those old databases are still connected to update systems.  And we don’t want to change them because it would take too long and could mess up the whole process of accepting a new employee.  It could disrupt many people in the company.  This way we simply keep those current update systems running, and the old database keeps storing new and old data.  Then let’s say that every evening we run the data warehouse updater updater program looks at one file that has Mary Smith as “Mary Smith” and other that has Mary Smith as “Mary Wilson.”  It looks at the date of the two records and notices that the “Wilson” record has a later date than the “Smith” record.  So it tells the data warehouse to make Primary Name as “Mary Wilson” and “Previous Name as “Mary Wilson.  And then we assure all reports and screen outputs pull information from the data warehouse.

Q: Can a Data Warehouse do other things beside correcting data?

A: Yes it can do many things.  Besides providing cleaner data we can allow reporting and querying to work faster.   Two examples.  Suppose when you access data, you use you run calculations to determining financial interest or the speed of a rocket launch in various conditions.   Furthermore, when you run those calculations, let’s imagine that you have to combine data from two, three or four different databases or computers.  For this reason, you find that you calculations a long time to run.

Well, when we build the warehouse few can pre-calculate the results of these calculations and store them in the warehouse. Then we won’t have the processing overhead. 

Here’s second example.  Suppose you have millions of records representing all book sales over a year, but your only interest is in the sales per month—who bought how much of a given book category each month and year.  Then we could save space and enhance speed by pre-summarizing data to the month level in the data warehouse, and again reduce processing time.

Finally, a data warehouse allows us to do what’s called “slice and dice.”  Suppose we want to analyze sales.  For any sale there is a product, location, customer, vendor, and date.  (Each of those five things is called a “dimension.”  Now in any spreadsheet report you can only have two dimensions, height and width.  Data warehouse query tools allows you, with the click of a button, to show sales by customer within region, product within region, product by customer, product by month, customer by month, etc. 

Furthermore, such tools allow you to do “drilldown.”  Thus, you notice sales for a given product were higher in 2004 than in 2005.  So then you drill down on “2005” and view sales of that one product for each month in 2005.  Suppose that June and July were the lowest month.  So then you analyze those two months by customer to see if some specific customer caused the sudden change.  All of those totals are pre-calculated and allow you to make important decision rather quickly.

Q: Finally, I’ve heard that data warehouses can be un-normalized.  Why is that allowed

A: Yes, that’s a rather common thing to do.   (Note, the viewer can go to the “data model” topic in this website to learn what “normalized” means).   If you’ve seen that page, you’ll remember the diagram that looks like this.

 

There are primarily two reasons why the data warehouse exist without normalization.  First, the data warehouse is updated not by human beings (who can introduce errors) but rather by a computer program.  Normalization is done to keep the database “tight” with all relationships inplace to insure accurate data.  But since the datawarehouse is already cleaned, no normalization is needed.

 

Secondly, by being un-normalized, the data warehouse runs much faster when you query it.  This is because it doesn’t have to join all of the data from all of those external files; it’s already there in the main file.  For example, If we want to print the product customer, and amount for each order, we don’t have to read the order table, then read the product table to find the product name, then read the customer table to get the customer name.  All the data is right there in the Order table.