Skip to main content

Lagunita is retiring and will shut down at 12 noon Pacific Time on March 31, 2020. A few courses may be open for self-enrollment for a limited time. We will continue to offer courses on other online learning platforms; visit http://online.stanford.edu.

Non-registered mode. Register to save your course progress.

By exploring the course, you are agreeing to our Terms of Service and Privacy Policy. Please read them carefully.

Captions: OLAP Demo

In this video we'll be doing a live demonstration of OLAP.

We'll create a star schema similar

to the one we used in our previous examples.

It will be sales data with

a fact table and several

dimension tables and then we'll

do several different types of OLAP queries.

We'll show a full star-join over the star schema.

We'll show some examples of drill-down and roll-up.

We'll also show what's known as

slicing and dicing which we

haven't explained yet, we'll just do when we get to those queries.

And finally we'll show the With

Cube and With Rollup

clauses that have been added

to the SQL standard for OLAP queries.

For those we'll be using

MySQL, actually for the whole demo we'll be using MySQL.

MySQL supports with roll-up.

Neither of the other systems

that we've been using, Sequel Lite

or postgres supports with roll up yet.

And MySQL does not yet

support With Cube, but we'll

see that we can simulate the

behavior With Cube, using

With Rollup command of MySQL.

So let's move to the demo.

Here we are in the demo, we

have a fact table that

contains information about stores, items and customers.

We don't see much here except their ID values.

And we have sixty tuples

loaded up in our fact table.

Our dimension tables are the customer table.

We have four customers with a gender and an age.

We have our item table: five

items, a category of the

item, just t-shirts and jackets,

just in blue and red, and

finally we have our stores.

And we have six stores that

are in various cities in Santa

Clara, San Mateo County in

California and King County in Washington.

Our first query is simply the complete star join.

So, we're joining all four tables

using the foreign key references

in the fact table, to the keys and the dimension tables.

Now since these are foreign key

references, you can think of

this star join as simply

extending the tuples in the

sales table with further information

about the store item and customer

mentioned in each one.

So, here we go.

And we should expect again sixty tuples in the results,

one for each tuple in the

fact table, and we can

see that we have the information from

the fact table, but then

we also have more information that

we've joined in from the dimension tables.

Now it's not typical to do

the complete star join, usually

we would have at least constrained the star join in some way.

So, our next query will do

selections and projections on the JOIN.

We're going to limit ourselves to

sales where the state of the

sale is California, the category

is t-shirt, the age of

the customer is less than 22,

and the last condition is actually over the fact table,

saying the price of the purchase was less than 25.

And now we'll return

the city, color, customer name, and price of the item.

We run the query and we

see now that we have

just 5 sales that meet our criteria.

So, we've seen the JOIN

with constraints on the

JOIN, but what's even more

common in OLAP applications, is

to do grouping and aggregation in order to analyze the data.

And we're going to start by

looking at grouping it in

the aggregation involving the dimension

attributes in the group

by, and then the

dependent attribute being aggregated.

This is a very common form of query.

So here we'll say we want

to analyze our data based on

combinations of stores and customers.

So, we'll group by store ID and customer ID.

And then, we'll sum up

to see the total sales for

each combination of store and customer.

So, here's our result.

Now, maybe not that meaningful

to us right now, just

looking at the store IDs and

customer IDs, but for an analyst, it might be meaningful.

We'll see in a moment doing

group by an aggregation on other

attributes that look a little more interesting.

So now I'll demonstrate the concept of drilling down.

So you might remember drilling down

says that we're looking at data

that's aggregated in some fashion,

and we decide we want more detail.

We get more detail by adding

more attributes to the

group by, so, let's say in

this case I want to break

out, not only by store and

customer, but also by item.

So, I'll add item ID to

the group by, and also to

the select, and when I

run this query, I see

that I get more results, and

I now have more detail

in the analysis of the data.

Now I don't have 60 tumbles

in my result even though I'm

grouping by all three dimension

attributes, because I do

have in my fact table

more than one tuple for some

combinations of store, item, and customer.

Next, I'm gonna introduce a query that shows the concept called slicing.

Slicing of course evokes the data cube.

We've talked about the data cube

in the previous video, and we

specifically used this example to

demonstrate a three-dimensional data cube.

So, the idea of a

slicing query is a query

that analyzes a slice of

the cube and it does

that by constraining one of the dimensions.

So what I am going to

do, is add to this query

a constraint that says

let's only consider sales that are from the state of Washington.

And when I add that we'll

continue to do the group

by an aggregation, but only on

that slice of the cube

representing Washington stores, the sales that are from Washington stores.

So it made a few changes to the query.

The first thing we see is

that I added the store

relation to the from clause

in order to constrain the

state of the sale, I

have to join with the store

table that dimension table so that

I can access

the value of state which is not present in the sales table.

So I also had to add

variables, and I add

the join condition for the

sales table to join

with the dimension table, and then I add the constraint

that the state is Washington.

The last small change is to

add a variable to the

store ID, so that I don't get an ambiguous error.

So now let's go ahead and

run the query and we see

that we do get a subset of what we had before.

And that subset is the slice

of the data cube, if you

want to think of it that way,

representing the sales in Washington State.

Now in addition to slicing, there

is also a notion of dicing,

and again if you imagine the

data cube, what a dice

does is slice in two

dimensions and it gives you a chunk of the cube.

So I am going to add a

condition on the item being sold.

I am going to add that the item must be red.

So then I'm going to be looking

at the chunk of the data

cube that identifies red

items purchased in the state of Washington.

So here's the query; I've added

a third relation to the "from" clause.

I added the item dimension tables

so that I can constrain the color to be red.

I have added a "join" condition, joining

the fact table to the item dimension table.

I have added the condition that the color is red.

And finally, I had to add

again the variable names to

make sure I didn't get ambiguous attribute references.

So we'll run this query.

And we see that we get an even smaller portion of the data cube.

Again a dice where we have constrained two dimensions.

So we have seen drilling down and

slicing and dicing, now let's take a look at rolling up.

Rolling up says that we're

looking at our data

analysis, and we decide that it's too detailed.

We want to have less detail, in other words, more aggregation.

To do that, we take attributes out of the group by clause.

So let's say this is way,

way too much detail and we

just want our sales broken down on item.

So we take all of

the attributes out of our

group by clause except for item.

Here we go.

And then, when we run the

query, we'll see much more summarized data.

And here it is, broken down again, just by separate items.

So far, our grouping and

aggregation has been on

the dimension attributes, specifically the

ID attributes identifying the tuples in the dimension table.

And that may be meaningful for

analyses, but for this

demonstration, it's more fun

to look at attributes that actually

mean something to us.

So, let's try a few

queries that group instead based

on attributes in the dimension tables.

We'll still be using the fact table.

We'll be joining it with dimension tables.

And here we're going to start by

grouping on the state and the category of the sales.

Again, summing up the total

sales for the different groups.

So, here we see that

we have four combinations of

state and category California or

Washington jackets and t-shirts,

and then we have the total

sales in each of those

so we can see in both states

the sales for jackets account

for considerably more than the

sales for t-shirts, but we

do also notice that in

California there's a lot

more sales of t-shirts than

there are in Washington, and I

guess that's not surprising given the weather.

Now let's demonstrate drill down on this query.

So let's say we want a little more information.

Maybe we want a breakdown by county in addition to state.

So to do that, we add

county to the group by

clause and the select

clause, and when we

run the query we see we do now have more detail.

We can see, for example, that

we had more jacket sales in

Santa Clara County than in

San Mateo County although the

t-shirts were approximately equal.

A little, actually, more t-shirts in San Mateo.

This is a little surprising, because

Santa Clara is generally warmer than

San Mateo, but it's fabricated data.

Now let's see, we want it

drilled it out even further, and

we want to break our

data down by gender as well as the other attributes.

In order to drill down based

on gender, I first needed

to add the customer table to our from clause.

Prior to this, we weren't

doing any analysis that involved

any attributes of the customer table.

And so I need to add the join condition here.

And then to do the drill

down, I add the gender attribute

to the group by and to the select.

We run the query, and what do we find?

Well, we can see, for

example that not too many

females in San Mateo County

are buying t-shirts. Most of those

t-shirt sales are counted for by males.

The other thing we notice

is that we don't have every

single combination of county, category and gender.

Very specifically we look and

we can see that no males

in King County bought any t-shirts.

So we only have in our

result, those combinations of dimension

values that actually appear in the fact table.

Or in the join of the fact table with the dimension tables.

Now let's suppose after I've

gotten to this level of

analysis, I've decided that what

might be most interesting is the

data broken down just by combination of state and gender.

So that would be a roll-up operation.

And remember for roll-up I

take attributes out of the group by clause.

So I take out the county

and category and I'm just interested in state/gender combinations.

By the way at this point if

I wanted to, I could remove

the item table because I'm

no longer constraining based on

items or grouping based on items.

But I'll just leave it in for now, it's not going to make a difference in our result.

Of course for efficiency I might just prefer to take it out.

Anyway let's run the query

and now we see that rolled up data.

And when we look at

it I guess our main conclusion here

would be that in our database,

the Californian males are the big spenders.

So far everything we've seen

has actually just used the

regular SQL constructs that we already knew about.

Although we were demonstrating them over

a star schema and we're

showing the types of queries that

one tends to run over star

schema's in OLAP applications.

Now we're going to move to

the specific constructs that have

been added to SQL for OLAP.

As you may remember the two

constructs are called With Cube

and With Rollup and they are written in the group by-clause.

So our first query is the

absolute basic cube query.

We start with our sales

fact table, we group on

all three of its dimensions,

and then we add With Cube.

And what we're going to get

is, basically, the data

in the data cube that we

saw in our previous video that introduced the cube.

Let's go ahead and run the query.

Disappointingly we get an

error message that this version

of MySQL doesn't yet support cube.

Actually no version of MySQL

supports cube at this

point in time but we're hopeful that a future one will.

In fact of the three

open source systems that we

have been using for our demos only

MySQL supports the with

rollup so we will continue

with MySQL and in fact

you can get the same result

that you would get if with

cube was supported, using with roll

up, although it's a little bit

contorted, but I'll do that

now, to demonstrate what you would get if you wrote with cube.

So here's the query, I'm not

going to explain it, if you're

particularly interested in it, you

can download our script and exam it yourself.

Again what I'm most interested in

is looking at the results, and here they are.

So this result is the

result of a With Cube

on the grouping and aggregation

on all dimensions of our sales table, all three dimensions.

So some of our tuples look very normal.

This first tuple, second, and

third tuple are all giving

us the total price for combination

of store, item, and customer,

and these are exactly what we

got in our original query that

didn't have the With-Cube operator.

But now we see some

tuples that have blanks and as

a reminder, blanks in this interface mean null.

So this tuple is store item and null.

And what this corresponds to is

an element on the face of the cube.

So this is the face of the

cube that's not constraining the customer dimension.

And what the value gives us

then is the value we

have on the face of the

cube, which is the

total sales for the

combination of store one

and item one and any customer, and that's seventy.

And we can cross check that

by seeing that the first

three tuples add up to

70, because those happen to be

the three customers that have made

purchases at Store 1 and Item 1.

And we can similarly see

for Store 1 and Item

2 the total sales are 150.

If we scroll down

a bit we'll see cases where

we have other single-null values;

I'll keep going until I find one here.

For example, here we have

the null value for the

store and that would

be the face of the cube

that's not constraining the store dimension.

So that would be the sum

for Item 1 and Customer

1 at any store, Item 1 and Customer 2 to at any store.

Then we also have tuples that have two null values, here's an example.

So this one is going to be corresponding to the edge of the cube.

so this is the edge of

the cube that is along

the store dimension and is

picking out Store 5 along

that dimension and then giving

the total sales for all

items and all customers in

Store 5, and we see that.

And again we'll see other

instances where we're missing

two values, scroll down to get here.

We've got one where we're

missing the store in the

item, so this would be

on the edge along the customer dimension.

So this gives us the total

sales for any store item made by customer 3.

And then finally, if we

find it, we'll find the

one that's all three null values,

and that's right here.

So that represents the corner of the cube.

That gives us the total

for all stores, all items and all customers.

So, what we've done

here by adding With Cube--and again,

that would be the result if

this query here were supported--what

we're getting is the contents

of the data cube, which are the

inner parts of the cube,

which is what we would get without

adding anything to our group

by and what we saw in one

of our very first queries of

the demo, in addition to the

data on the faces, the

edges, and the corner of the cube.

Now let's do a quick cross check.

So this data tells us

that the corner of the cube,

here, the one with all null

values, is 33.50 total price.

So, that should be exactly what we

get if we sum up

our prices without doing any

grouping or aggregation so lets give that a try.

So we merely take our sales

table, we sum up the

total prices and we run

it and indeed we get thirty-three fifty.

So now we've seen how SQL

can be used to effectively create the data cube.

Sometimes in OLAP applications it's convenient to query the cube directly.

So people frequently create data

cube say, in SQL using

a materialized view, or just

by creating a table that contains

the result of the data cube which is what we are going to do.

There are even some systems that

use the data cube as their basic native data model.

So let's go ahead and take

the result of our cube

query and put it in a table.

So I've just added "create

table cube as" before our

query and the one the

other thing I did was add a

name to the result of

the aggregation attribute so

that we can refer to it in the data cube.

So P then will then be

containing the sum of the price for

the items inside the data

cube as well as for the faces, edges, and corners.

So, let's go ahead and create

that table and now let's look at querying the cube directly.

So, the first thing you will notice in

the next few queries is that

we've replaced the sales table

with the cubed table so we're

gonna run our queries directly on the

cube and we can

join the cube with the

store and item tables and

the customer, if we needed it,

just as we joined the sales

table because it does contain the same dimension attributes.

What the cube gives us is

pre-aggregated data, both for

the store, a customer and item combinations.

As well as the data

that has null values that is

already aggregated for the

faces edges and corner of the cube as we'll just see in a moment.

So what our query is going

to do is find total

sales of blue items in the state of California.

And it will start by showing the

tupples that are used to get that total.

So we'll join our cube table

with the store and the item

dimension tables in order to

constrain the state and the

color and then we will constrain the state and the color.

Notice the last thing we

say is that customer id

is null and that's going

to give us the data on the

face of the cube that

doesn't go along with the customer's dimension.

That means it's going to

be pre-aggregated for all customers

and that's what we want since

we don't care about the customers

and just about the color and the state.

So, let's first run this query

and we see we get six tuples,

and these tuples are all

from the portion of the result

of that cube query we ran

that has a null value for

customer ID and that

is all combinations of stores of

items, and if we checked

our dimension tables we'd see

that these stores are in California

and these items are blue and

these are our total sales broken

down by store and item.

Finally, to get the total total sales,

we'll just sum up our column

"p" which remember was

the sum of price in the data queue.

So we replace c. with sum

of p and we run

the query and we see that our total is 13.65.

Now let me show you something that

may seem non-intuitive at first,

but if you think about it

or maybe run the queries yourself,

you'll start to understand exactly what's happening.

I'm going to go back to c. in the select clause.

And incidentally, I didn't mention

it before but c. gives us

in the select clause all the

attributes from the cube

table, and that is

showing which tuples are being used for the aggregation.

So, we had before six tuples that were being used.

Now, I'm going to make one tiny change to the query.

Instead of finding the customer

IDs that are null, I'm going

to find the customer IDs that are not null.

I'm going to run that query, and

I see that we have fourteen tuples now.

What we've actually done is moved

away from the face of

the cube and into the cube,

and I've said don't give me

any of the summarized data, just

give me the original values that

are the store item and customer together.

So, I've taken away the summaries,

but actually this should still give me the same answer.

Let's make sure that's the case.

I do sum of P, and

I run it, and I get 1365.

So, what happened here is I

used what's effectively the same

data, but broken down by customer.

And I added it up, that

was a less efficient query, but

both of those queries do

give the correct result, they do

give you the total sales of California

stores' blue items.

Again, I encourage you to

download these and run them

yourself to figure out exactly what's

going on.

And as a last demonstration of

this very same query, we

can actually go back to the sales table.

Let's say we don't even have our data cube.

So, we're gonna use our sales table

instead, and it should

give us the same result, but

even less efficient than the the last query that we did.

So, you can see here I've

changed the cube to be

the sales and all the C dots to F dots.

I took away the constraints

on being null or not null,

since that's not relevant in the sales table.

Otherwise, the query looks pretty much the same.

So, let's run this much

and see how many tuples we are operating on now.

So now, we're operating on 25 tuples.

So, this is the original completely broken down data.

Just as a reminder, in the

data cube, even in the

cells, we'll be summarizing

or summing for the combination

of store item and customer.

So, if we have three tuples that

are the same store item and customer,

those will be broken out in

the sales table and summarized in

the cube, even in the

least summarized portion of the cube, if you get what I'm saying.

So now, we've got our 25

tuples, and if we

sum up the price of those

twenty-five tuples, we should

get that same result, "$13.65", and we do.

So, we saw that we had

six tuples in the

most efficient query over the data cube.

When we used just a center

portion of the cube, we had

fourteen or something, I actually

don't remember the exact number, and then 25.

All of them gave us the same answer.

Obviously these numbers are very

small, but if you think about

huge data that has millions

or tens of millions of rows,

then we're talking about dramatic differences

in how much data needs to

be summarized or aggregated in order to get the answer.

So, using the summary data from

the cube can be orders of

magnitude faster than going to the original fact data.

Now a variation on the

width cube is to give

specific attributes with the cube operator itself.

And what that says is to

only give summaries for

these dimension attributes and not the others.

In other words, we'll only see null

values in the result for the store ID and customer ID

and we won't see any null values for the Item ID.

So, in some sense, we're

materializing just some of

the faces, edges, and corner

of the cube, not all of them.

If we run this query, well,

we'll get the same results that

MySQL doesn't yet support it,

but this is the SQL standard

for the cube operator with subsets of attributes.

As before, I've cooked up an

equivalent query using a MySQL's

with roll-up command, and I'm

certainly not going to explain it,

but you're welcome to download it and examine it.

What I really like to look

at is the result, and here it is.

So, this looks a lot like our

original data cube but what we will notice is

that there are no null values

in the item column.

So, it's a subset of the result of the cube query.

And we would use that when

we know we're never going to be rolling up on items.

So, that dimension, the summaries

based on that dimension aren't going to be useful to us.

Now, let's look at with roll-up,

which is supported natively by the MySQL system.

With roll-up, again, gives us a subset of the data.

cube.

But, it's a very specific one

and it's based on the ordering of

the attributes in the group by clause.

So, let me just go ahead and run it and explain what we get.

We again get null values,

and those null values indicate the summary.

For example, the Store 1, Item 1 and 70s.

The sum of prices for

all customers for Store 1 and Item 1.

And we see Store 1, Item 2, as well.

We again see cases with

two null values, so this

is the summary for Store 1,

all of the items and customers,

the total sales, and we

also have the triple null somewhere.

It's at the bottom this time, with total sales of $33.50.

But what we'll see

is that we don't have all combinations of null values.

We have a null value for customer

ID, or we have nulls

for the combination of customer ID

and item ID, or all three nulls.

So we only have the right-hand

attribute or the two

most-right attributes or all

of them and if we had

a fourth dimension, we'd see that

we have the right-most, two right-most, three right-most.

This doesn't really make any

particular sense for this query,

but it's used when we have dimensions that are hierarchical.

And I actually introduced a hierarchical

dimensions in our original schema just for this purpose.

So, lets turn to that query.

So, here's a query that's grouping

by the state, county, and city.

These are three dimension attributes,

they all come from the store dimension table.

And they are hierarchical, meaning

that we don't have every

combination of state, county and city.

We have cities that are in

specific counties, and counties

that are in specific states.

So, when we look at the results

of the query, we see, of

course, that we, when we

have San Mateo is always in the state of California.

King County is always in the state of Washington.

We don't have Palo Alto combined

with, say, King County with state California.

So, we don't have all the combinations of the three.

We have a hierarchical structure on them.

And it's this structure that makes sense when we use a roll-up.

So, let's add with roll-up to this query and see what we get.

So, here in our result we see

our original tuples for a state, county, city combination.

And then, we see for example,

this tuple here, which is

taking the state of California

and the county of San Mateo

and adding up all of the sales there.

And we see similarly that we have all of the sales for Santa Clara County.

Then we can see that we

have all of the sales for

each state, so this is

the California sales and the Washington sales.

And finally, the triple null is

the corner the cube, it's

all of our sales and, as

usual, we get the total of 33.50.

Now, what don't we

see in here compared with the data cube?

Well, we don't see, for example,

a tuple that has

California and Palo Alto

and the county as null.

Why is that?

Well, Palo Alto is always in Santa Clara County.

So rolling up the county,

or saying I want California and

Palo Alto sales for every

county, is exactly the

same as saying I want

California and Palo Alto sales in Santa Clara County.

We also don't see, for

example, the state and county both being null.

For example, if we had

Seattle as a city and

the state and county being null.

Well, Seattle is always in

King County and always in

Washington, so we're not aggregating

anything there, we get the

exact same results as having Seattle King in Washington.

So if we ran WITH CUBE

on this hierarchically-structured data, we'd

actually not get anymore information,

we'd have a bigger result, but it wouldn't be giving us more information.

It would just be less efficient for getting the same data.

So that concludes our OLAP demonstration.

We saw Star Schema and

we saw plain SQL queries over that schema.

We saw the concept of drilling

down and rolling up; also slicing and dicing.

We introduced a WITH CUBE

extension to SQL, which is

not yet implemented in MySQL,

but we were able to write

a query that's equivalent to WITH CUBE.

We also saw putting a WITH

CUBE query into a table

and then querying that table directly

and that can be much more efficient

than running the equivalent query

in SQL directly over the fact table.

We also saw WITH ROLLUP, which is implemented.

We didn't demonstrate putting the

result of WITH ROLLUP in a table, but we could certainly do that too.

All of these features are useful

primarily in applications that

are performing analyses over very

large data sets that exhibit

this dimensional type structure, but

this is actually quite a

common structure in analysis applications.

Last modified:
March 1, 2014, 6:59 p.m.