Chinaunix首页 | 论坛 | 博客
  • 博客访问: 226176
  • 博文数量: 31
  • 博客积分: 1427
  • 博客等级: 上尉
  • 技术积分: 409
  • 用 户 组: 普通用户
  • 注册时间: 2006-04-04 15:21
文章分类

全部博文(31)

文章存档

2008年(31)

我的朋友

分类: Mysql/postgreSQL

2008-04-22 13:23:55

8章:定期装载

       本章包含定期装载。和初始化装载不同,初始化装载你只需要在开始数据仓库运作前执行一次,定期装载需要你安排一个定期装载计划来定期的载入源数据。

       在本章,我将向你展示,通过运行一个脚本在我们的dw数据库中完成定期装载之前,如何准备数据。

 

鉴别数据源和装载类型

       部署一个定期装载计划的第一个步骤是为数据仓库中的每个事实表和维度表鉴别哪些源数据是必须的以及可用的。然后,你才决定适合该装载的提取模式和载入类型。表8-1所示的范例文档概括了这些信息。

8-1:定期装载的数据源和载入类型

Source Data

源数据

Data Warehouse Table

数据仓库表

Extraction Mode

提取模式

Loading Type

载入类型

Customer

customer_dim

Whole, Pull

SCD2 on address

SCD1 on name

Product

product_dim

Whole, Pull

SCD2

Sales order Transaction

order_dim

CDC (daily), Pull

Unique order number

sales_order_fact

CDC (daily), Pull

Daily sales orders

n/a

date_dim

n/a

Pre-populate

你已经在第5章“源数据提取”学习了提取模式,以及在第2章“维度历史”学习了SCD2

       另外一个源数据的特性可能影响你设计,即当一个特殊的数据可以用来定期载入时的时间窗问题,这点对于事务型的源数据特别重要,因为这种源数据量通常是巨大的,比如销售订单。

       另外,你需要知道每个源数据的细节特征,比如文件类型,数据结构,甚至到个别的字段。

 

定期装载脚本\

       我将用列表8-1所示的脚本来解释如何使定期装载起作用。你可以用它以每天为周期来装载数据仓库。抽取模式和装载类型如下所示:

n         customer.csv product.txt的固定格式文件分别通过customer_stg product_stg 表为中间表载入到customer_dim product_dim 表中。载入是通过mysqlLOAD DATA INFILE工具完成的。

n         SCD2被应用到客户地址,产品名称,以及产品组中。SCD1应用到客户名称中。

n        只有开始于当前时间的销售订单记录被载入到order_dim sales_order_fact 表中。

列表8-1:每日的dw定期装载

/*****************************************************************/

/*                                                               */

/* dw_regular.sql                                                */

/*                                                               */

/*****************************************************************/

 

USE dw;

 

/* customer dimension loading                                    */

 

TRUNCATE customer_stg

;

 

LOAD DATA INFILE 'customer.csv'

INTO TABLE customer_stg

FIELDS TERMINATED BY ', '

OPTIONALLY ENCLOSED BY ' " '

LINES TERMINATED BY '\r\n'

IGNORE 1 LINES

( customer_number

, customer_name

, customer_street_address

, customer_zip_code

, customer_city

, customer_state)

;

 

/* SCD2 on customer street addresses                             */

 

/* first, expire the existing customers                          */

 

UPDATE

  customer_dim a

, customer_stg b

SET

  expiry_date = SUBDATE (CURRENT_DATE, 1)

WHERE

    a.customer_number = b.customer_number

AND a.customer_street_address <> b.customer_street_address

AND expiry_date = '9999–12–31'

;

 

/* then, add a new row for the customer                          */

 

INSERT INTO customer_dim

SELECT

  NULL

, b.customer_number

, b.customer_name

, b.customer_street_address

, b.customer_zip_code

, b.customer_city

, b.customer_state

, CURRENT_DATE

, '9999–12–31'

FROM

  customer_dim a

, customer_stg b

WHERE

    a.customer_number = b.customer_number

AND (a.customer_street_address <> b.customer_street_address)

AND EXISTS(

SELECT *

FROM customer_dim x

WHERE

    b.customer_number=x.customer_number

AND a.expiry_date=SUBDATE (CURRENT_DATE, 1))

AND NOT EXISTS (

SELECT *

FROM customer_dim y

WHERE

    b.customer_number = y. customer_number

AND y.expiry_date = '9999–12–31')

;

 

/* SCD1 on customer name                                              */

 

UPDATE customer_dim a, customer_stg b

SET a.customer_name = b.customer_name

WHERE a.customer_number = b.customer_number

      AND a.customer name <> b.customer name

;

 

/* add new customer                                                   */

 

INSERT INTO customer_dim

SELECT

  NULL

, customer_number

, customer_name

, customer_street_address

, customer_zip_code

, customer_city

, customer_state

, CURRENT_DATE

,'9999–12–31'

FROM customer_stg

WHERE customer_number NOT IN(

SELECT y.customer_number

FROM customer_dim x, customer_stg y

WHERE x.customer_number = y.customer_number)

;

 

/* product dimension loading                                           */

 

TRUNCATE product_stg

;

 

LOAD DATA INFILE 'product.txt'

INTO TABLE product_stg

FIELDS TERMINATED BY ' '

OPTIONALLY ENCLOSED BY ' '

LINES TERMINATED BY '\r\n'

IGNORE 1 LINES

( product_code

, product_name

, product_category)

;

 

/* SCD2 on product name and group                                     */

 

/* first, expire the existing product                                 */

 

UPDATE

  product_dim a

, product_stg b

SET

  expiry_date = SUBDATE (CURRENT_DATE, 1)

WHERE

    a.product_code = b.product_code

AND (   a.product_name <> b.product_name

     OR a.product_category <> b.product_category)

AND expiry_date = '9999–12–31'

;

 

/* then, add a new row for the product                                */

 

INSERT INTO product_dim

SELECT

  NULL

, b.product_code

, b.product_name

, b.product_category

, CURRENT_DATE

,'9999–12–31'

FROM

  product_dim a

, product_stg b

WHERE

    a.product_code = b.product_code

AND (   a.product_name <> b.product_name

     OR a.product_category <> b.product_category)

AND EXISTS(

SELECT *

FROM product_dim x

WHERE     b.product_code = x.product_code

      AND a.expiry_date = SUBDATE (CURRENT_DATE, 1))

AND NOT EXISTS (

SELECT *

FROM product_dim y

WHERE     b.product_code = y.product_code

      AND y.expiry_date = '9999–12–31')

;

 

/* add new product                                                    */

 

INSERT INTO product_dim

SELECT

  NULL

, product_code

, product_name

, product_category

, CURRENT_DATE

, '9999–12–31'

FROM product_stg

WHERE product_code NOT IN(

SELECT y.product_code

FROM product_dim x, product_stg y

WHERE x.product_code = y.product_code)

;

 

/* end of product_dim loading                                         */

 

INSERT INTO order_dim (

  order_sk

, order_number

, effective_date

, expiry_date)

SELECT

  NULL

, order_number

, order_date

, '9999–12–31'

FROM source.sales_order

WHERE entry_date = CURRENT_DATE

;

INSERT INTO sales_order_fact

SELECT

  order_sk

, customer_sk

, product_sk

, date_sk

, order_amount

FROM

  source.sales_order a

, order_dim b

, customer_dim c

, product_dim d

, date_dim e

WHERE

    a.order_number = b.order_number

AND a.customer_number = c.customer_number

AND a.order_date >= c.effective_date

AND a.order_date <= c.expiry_date

AND a.product_code = d.product_code

AND a.order_date >= d.effective_date

AND a.order_date <= d.expiry_date

AND a.order_date = e.date

AND a.entry_date = CURRENT_DATE

;

 

/* end of script                  */

阅读(1518) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~