2008年(31)
分类: Mysql/postgreSQL
2008-04-22 13:23:55
第8章:定期装载
本章包含定期装载。和初始化装载不同,初始化装载你只需要在开始数据仓库运作前执行一次,定期装载需要你安排一个定期装载计划来定期的载入源数据。
在本章,我将向你展示,通过运行一个脚本在我们的dw数据库中完成定期装载之前,如何准备数据。
鉴别数据源和装载类型
部署一个定期装载计划的第一个步骤是为数据仓库中的每个事实表和维度表鉴别哪些源数据是必须的以及可用的。然后,你才决定适合该装载的提取模式和载入类型。表8-1所示的范例文档概括了这些信息。
表8-1:定期装载的数据源和载入类型
Source Data 源数据 |
Data Warehouse Table 数据仓库表 |
Extraction Mode 提取模式 |
Loading Type 载入类型 |
Customer |
customer_dim |
Whole, Pull |
SCD2 on address SCD1 on name |
Product |
product_dim |
Whole, Pull |
SCD2 |
Sales order Transaction |
order_dim |
CDC (daily), Pull |
Unique order number |
sales_order_fact |
CDC (daily), Pull |
Daily sales orders | |
n/a |
date_dim |
n/a |
Pre-populate |
注你已经在第5章“源数据提取”学习了提取模式,以及在第2章“维度历史”学习了SCD2。
另外一个源数据的特性可能影响你设计,即当一个特殊的数据可以用来定期载入时的时间窗问题,这点对于事务型的源数据特别重要,因为这种源数据量通常是巨大的,比如销售订单。
另外,你需要知道每个源数据的细节特征,比如文件类型,数据结构,甚至到个别的字段。
定期装载脚本\
我将用列表8-1所示的脚本来解释如何使定期装载起作用。你可以用它以每天为周期来装载数据仓库。抽取模式和装载类型如下所示:
n customer.csv 和product.txt的固定格式文件分别通过customer_stg 和 product_stg 表为中间表载入到customer_dim和 product_dim 表中。载入是通过mysql的LOAD DATA INFILE工具完成的。
n SCD2被应用到客户地址,产品名称,以及产品组中。SCD1应用到客户名称中。
n 只有开始于当前时间的销售订单记录被载入到order_dim 和 sales_order_fact 表中。
列表8-1:每日的dw定期装载
/*****************************************************************/
/* */
/* dw_regular.sql */
/* */
/*****************************************************************/
USE dw;
/* customer dimension loading */
TRUNCATE customer_stg
;
LOAD DATA INFILE 'customer.csv'
INTO TABLE customer_stg
FIELDS TERMINATED BY ', '
OPTIONALLY ENCLOSED BY ' " '
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
( customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state)
;
/* SCD2 on customer street addresses */
/* first, expire the existing customers */
UPDATE
customer_dim a
, customer_stg b
SET
expiry_date = SUBDATE (CURRENT_DATE, 1)
WHERE
a.customer_number = b.customer_number
AND a.customer_street_address <> b.customer_street_address
AND expiry_date = '9999–12–31'
;
/* then, add a new row for the customer */
INSERT INTO customer_dim
SELECT
NULL
, b.customer_number
, b.customer_name
, b.customer_street_address
, b.customer_zip_code
, b.customer_city
, b.customer_state
, CURRENT_DATE
, '9999–12–31'
FROM
customer_dim a
, customer_stg b
WHERE
a.customer_number = b.customer_number
AND (a.customer_street_address <> b.customer_street_address)
AND EXISTS(
SELECT *
FROM customer_dim x
WHERE
b.customer_number=x.customer_number
AND a.expiry_date=SUBDATE (CURRENT_DATE, 1))
AND NOT EXISTS (
SELECT *
FROM customer_dim y
WHERE
b.customer_number = y. customer_number
AND y.expiry_date = '9999–12–31')
;
/* SCD1 on customer name */
UPDATE customer_dim a, customer_stg b
SET a.customer_name = b.customer_name
WHERE a.customer_number = b.customer_number
AND a.customer name <> b.customer name
;
/* add new customer */
INSERT INTO customer_dim
SELECT
NULL
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, CURRENT_DATE
,'9999–12–31'
FROM customer_stg
WHERE customer_number NOT IN(
SELECT y.customer_number
FROM customer_dim x, customer_stg y
WHERE x.customer_number = y.customer_number)
;
/* product dimension loading */
TRUNCATE product_stg
;
LOAD DATA INFILE 'product.txt'
INTO TABLE product_stg
FIELDS TERMINATED BY ' '
OPTIONALLY ENCLOSED BY ' '
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
( product_code
, product_name
, product_category)
;
/* SCD2 on product name and group */
/* first, expire the existing product */
UPDATE
product_dim a
, product_stg b
SET
expiry_date = SUBDATE (CURRENT_DATE, 1)
WHERE
a.product_code = b.product_code
AND ( a.product_name <> b.product_name
OR a.product_category <> b.product_category)
AND expiry_date = '9999–12–31'
;
/* then, add a new row for the product */
INSERT INTO product_dim
SELECT
NULL
, b.product_code
, b.product_name
, b.product_category
, CURRENT_DATE
,'9999–12–31'
FROM
product_dim a
, product_stg b
WHERE
a.product_code = b.product_code
AND ( a.product_name <> b.product_name
OR a.product_category <> b.product_category)
AND EXISTS(
SELECT *
FROM product_dim x
WHERE b.product_code = x.product_code
AND a.expiry_date = SUBDATE (CURRENT_DATE, 1))
AND NOT EXISTS (
SELECT *
FROM product_dim y
WHERE b.product_code = y.product_code
AND y.expiry_date = '9999–12–31')
;
/* add new product */
INSERT INTO product_dim
SELECT
NULL
, product_code
, product_name
, product_category
, CURRENT_DATE
, '9999–12–31'
FROM product_stg
WHERE product_code NOT IN(
SELECT y.product_code
FROM product_dim x, product_stg y
WHERE x.product_code = y.product_code)
;
/* end of product_dim loading */
INSERT INTO order_dim (
order_sk
, order_number
, effective_date
, expiry_date)
SELECT
NULL
, order_number
, order_date
, '9999–12–31'
FROM source.sales_order
WHERE entry_date = CURRENT_DATE
;
INSERT INTO sales_order_fact
SELECT
order_sk
, customer_sk
, product_sk
, date_sk
, order_amount
FROM
source.sales_order a
, order_dim b
, customer_dim c
, product_dim d
, date_dim e
WHERE
a.order_number = b.order_number
AND a.customer_number = c.customer_number
AND a.order_date >= c.effective_date
AND a.order_date <= c.expiry_date
AND a.product_code = d.product_code
AND a.order_date >= d.effective_date
AND a.order_date <= d.expiry_date
AND a.order_date = e.date
AND a.entry_date = CURRENT_DATE
;
/* end of script */