Saturday, November 24, 2007
The Evolution of the Date warehouse
An Introduction to the Data Warehouse
--------------------------------------------------------------------------------
Introduction
Data warehousing has quickly evolved into the center of the architecture for information systems. Many companies have adopted data warehouse technology in response to an ever-shifting competitive landscape. There are numerous examples of successful data warehouses developed and deployed for businesses of various sizes and types. Hardware, software, and consulting vendors have quickly developed numerous new products and services that specifically target this rapidly expanding market segment. Also, trade magazines are filled with articles on various aspects of data warehousing.
What is a data warehouse then? W. H. Inmon, who is known as the "father of the data warehouse", defines the data warehouse as "a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management's decisions."(http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/). This definition will be explained in detail below. But let us first briefly review the historical management of the analysis data and the impacts on the evolution of data warehouses.
Development of data warehousing
Historically system development had primarily emphasized on operational systems -- information systems that support the day-to-day mission-critical operations of the enterprise -- and the data they process. Another part of I/S strategy has always been archiving the data that the operational systems have processed, and strategic reporting of such warehoused information to facilitate planning, forecasting and managing the organization. Most of business data for large corporations had been stored on expensive mainframes -- the so-called legacy systems -- or archived to tapes to satisfy some of the need for strategic analysis. I/S provided access to that data through programs that generated reports and extracts, usually overnight or over weekends so that business analysis would not interfere with and degrade the performance of operational systems. However, the lead-time required to develop a strategic reporting application and enhance it as requirements changed frequently turned out to be longer than affordable by the end users. They needed to react quickly to changing business circumstances.
This situation has been changed during the past decade with the proliferation of personal computers on business desktops. Users became no longer limited to dumb terminals linked to untouchable mainframes. To circumvent the I/S bottleneck, they began to build their own specific applications with PC-based databases and spreadsheets. But this model for business analysis has the drawback of leaving the data fragmented and oriented towards very specific needs. Control, capacity, and integrity issues often got out of hand. The extracts were unable to address the needs of multiple users and uses. The numerous "data islands" exacerbated the data access crisis suffered by many enterprises.
Two other analysis systems -- decision support systems and executive information systems -- may be viewed as the closest precursors to data warehousing systems. Both have data in descriptive standard business terms and generally preprocessed with the application of standard business rules, and both provide consolidated views of the data. Decision support systems are targeted towards lower to mid-level managers, and tend to focus more on detail than executive information systems. However, their designs are also derived from specific requirements, rather than the overall business structure, and the cost and coordination required for their development adversely affected their popularity.
Many factors have influenced the fast evolution of data warehousing. On the one hand, enormous advances in the hardware and software technologies have enabled the retention and quick analysis of extensive information and business knowledge. From the hardware side, data warehouses are exploiting the incredibly powerful processors and higher capacity memory chips now available at very low prices, sophisticated yet cost-efficient processor hardware architecture, the high-speed Computer Bus and controller interfaces, and the vastly shrunk disk drive. The skyrocketing power of server and desktop software, along with the availability of affordable and easy-to-use reporting and analysis tools have also played important roles in the development of data warehouses. What is more, the explosion of Intranets and Web based applications with the open Internet standards has greatly impacted data warehousing as well. On the other hand, there are some business and organizational forces at work. "Business process reengineering" and "downsizing" following the economic downturn of the late eighties have forced enterprises to reevaluate their business practices, and to identify and focus on their core competency areas. These phenomena and the globalization of business all have increased the need for continuous analysis and centralized management of data. Moreover, the emergence of standard yet flexible business applications, the existence of more technology savvy end users, and the heightened awareness of trends in information technology among mid and upper management levels all have significant bearing on the development and deployment of data warehouses.
Though W. H. Inmon started writing about the concept of the data warehouse as early as in 1981, the first data warehouse model was the Information Warehouse framework announced by IBM in September 1991. By mid-1994, numerous data warehouse products came on the market. According to META Group of Westport, Connecticut, by early 1996, over 90% of large companies had implemented or were planning to develop data warehouses.
Return to top
Key data warehousing attributes
Having looked at the data warehouse from a historical perspective, now we will turn to discuss its defining features contrasted with the attributes of operational applications.
To begin with, the data warehouse is oriented around the major subjects of the enterprise, e.g., customer, vendor, product and productivity. Hence it focuses on data modeling and database design exclusively, and it excludes data that is not useful for DSS processing. In contrast, the operational applications are designed around processes and functions, e.g., loans, savings, bank card and trust for a financial institution. Consequently, they are concerned both with database design and process design, and they contain data that satisfies immediate functional/processing requirements. A further ramification of the difference in orientation between the data warehouse and operational applications is their difference in the relationships of data. While data warehouse data spans a spectrum of time, maintains many relationships, and represents many static business rules (and correspondingly, many data relationships) between two or more tables, operational data maintains an ongoing relationship between two or more tables based on a business rule that is in effect.
Moreover, data contained within the boundaries of the warehouse is integrated, that is, stored in a singular, globally acceptable fashion, although the underlying operational systems may store the data in various different ways. Data warehouse systems prove most successful when data can be combined from multiple source applications, when all sorts of data inconsistencies have to be effectively addressed. "Data scrubbing" or "data staging" enables the DSS analyst to focus on using the data that is in the warehouse, without having to wonder about its credibility or consistency. The integration of data is manifested in many ways -- in consistent naming conventions, in consistent measurement of variables, in consistent encoding structures, in consistent physical definition of attributes, and so on.
Further, all data in the data warehouse is "time variant", i.e., accurate as of some moment in time, whereas in the operational environment data is accurate as of the moment of access. Thus, the time horizon represented for the data warehouse is much longer (which can involve years) than that for the operational environment (which ranges from the current values of today to ninety days). Every key structure in the data warehouse contains an element of time either implicitly or explicitly.
Lastly, the data warehouse is nonvolatile. Data warehouse data is a long series of snapshots, and cannot be updated once correctly recorded, while record-to-record real-time updates -- inserts, deletes, and changes -- are done regularly to the operational environment. That is, once data is loaded into the warehouse from the application-oriented operational environment (and/or external sources), it does not change, but is merely accessed there. Therefore, there is no need to be cautious of the update anomaly, an important factor to consider in operational application systems; nor does data warehousing require the complex technologies supporting backup and recovery, transaction and data integrity, and the detection and remedy of deadlock. Data "updating" in the data warehousing environment consists of periodic mass loading of data from the operational environment. The simplicity of data management and the much less rigid response time requirements allow data warehouse designers to take liberties in optimizing the access of data. De-normalization of the physical data model is conducted to enhance performance and simplicity, which are more prominent for data warehouse operations because the amount of data involved is typically very large.
Return to top
An architectural perspective on data warehousing
Inmon identifies five types of data in a data warehouse: older detail data, current detail data, lightly summarized data, highly summarized data, and meat data. Both older detail data and current detail data are stored at the lowest level of granularity, but the former is more voluminous and less frequently accessed than the latter. Thus, current detail data is usually stored on disk storage, whereas older detail data on some form of mass storage, e.g., magnetic tape, photo optical storage, RAID, micro fiche. Derived from the current detail data, lightly summarized data is the summarization of some attributes over some units of time. Like current detail data, it is almost always stored on disk storage. The highly summarized data is distilled from the lightly summarized data, and is compact, widely accessible, and most frequently used. Meta data is the data about data within the enterprise. According to Inmon, it contains (at minimum) the following three components:
the structure of the data, which serves as a directory of the contents of the data warehouse
the algorithms used for summarization
the mapping of data from the original sources to the data warehouse
Inmon also specifies a predictable flow of data within the data warehouse. As data is loaded into the data warehouse from the operational environment or some external source, it is transformed and then goes into the current detail level, where it is used in the calculation of the lightly summarized data and the highly summarized data or in some ad hoc queries. Then it is either moved to the older detail level through the aging process or simply purged.
Ken Orr outlines a Data Warehouse Architecture (DWA), broadly defined as "a way of representing the overall structure of data, communication, processing and presentation that exists for end user computing within the enterprise"(http://www.kenorrinst.com/dwpaper.html). It is comprised of the following interconnected parts:
Operational Data Base/External Data Base Layer
Information Access Layer, including the tools for manipulating, analyzing and presenting data
Data Access Layer, which is "responsible for interfacing between Information Access tools and Opeartional Data Bases"
Data Directory (also called the "Metadata") Layer
Process Management Layer, "the scheduler or the high level job control for" "the various tasks that must be accomplished to build and maintain the data warehouse and data directory information"
Application Messaging (also known as the "Middleware") Layer, involving the enterprise-wide transportation of information
Data Warehouse (Physical) Layer, where copies of operational and/or external data are actually stored
Data Staging (also called "Copy Management" or "Replication Management") Layer, including "all of the processes necessary to select, edit, summarize, combine and load data warehouse and information access data from operational and/or external data bases"
Return to top
Data warehouse products and services
The following are the links to data warehouse frameworks from the leading vendors:
HP's OpenWarehouse
Sybase's Data Warehouse Products
IBM's Information Warehouse
Software AG's Data Warehouse Intelligon
Data Warehousing with Informix
NCR Corp.'s (the former AT&T Global Information Solutions') Scalable Data Warehousing
Prism's Data Warehousing and Marts
Red Brick's PaVER Gold Program
the SAS Warehouse Solution
Siemens Pyramid's Smart Warehouse
Oracle's Data Warehousing
Links to vendors of various tools that can be used to build data warehouses and sites of providers of various services can be found at Larry Greenfield's "Data Warehousing Information Center
--------------------------------------------------------------------------------
Introduction
Data warehousing has quickly evolved into the center of the architecture for information systems. Many companies have adopted data warehouse technology in response to an ever-shifting competitive landscape. There are numerous examples of successful data warehouses developed and deployed for businesses of various sizes and types. Hardware, software, and consulting vendors have quickly developed numerous new products and services that specifically target this rapidly expanding market segment. Also, trade magazines are filled with articles on various aspects of data warehousing.
What is a data warehouse then? W. H. Inmon, who is known as the "father of the data warehouse", defines the data warehouse as "a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management's decisions."(http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/). This definition will be explained in detail below. But let us first briefly review the historical management of the analysis data and the impacts on the evolution of data warehouses.
Development of data warehousing
Historically system development had primarily emphasized on operational systems -- information systems that support the day-to-day mission-critical operations of the enterprise -- and the data they process. Another part of I/S strategy has always been archiving the data that the operational systems have processed, and strategic reporting of such warehoused information to facilitate planning, forecasting and managing the organization. Most of business data for large corporations had been stored on expensive mainframes -- the so-called legacy systems -- or archived to tapes to satisfy some of the need for strategic analysis. I/S provided access to that data through programs that generated reports and extracts, usually overnight or over weekends so that business analysis would not interfere with and degrade the performance of operational systems. However, the lead-time required to develop a strategic reporting application and enhance it as requirements changed frequently turned out to be longer than affordable by the end users. They needed to react quickly to changing business circumstances.
This situation has been changed during the past decade with the proliferation of personal computers on business desktops. Users became no longer limited to dumb terminals linked to untouchable mainframes. To circumvent the I/S bottleneck, they began to build their own specific applications with PC-based databases and spreadsheets. But this model for business analysis has the drawback of leaving the data fragmented and oriented towards very specific needs. Control, capacity, and integrity issues often got out of hand. The extracts were unable to address the needs of multiple users and uses. The numerous "data islands" exacerbated the data access crisis suffered by many enterprises.
Two other analysis systems -- decision support systems and executive information systems -- may be viewed as the closest precursors to data warehousing systems. Both have data in descriptive standard business terms and generally preprocessed with the application of standard business rules, and both provide consolidated views of the data. Decision support systems are targeted towards lower to mid-level managers, and tend to focus more on detail than executive information systems. However, their designs are also derived from specific requirements, rather than the overall business structure, and the cost and coordination required for their development adversely affected their popularity.
Many factors have influenced the fast evolution of data warehousing. On the one hand, enormous advances in the hardware and software technologies have enabled the retention and quick analysis of extensive information and business knowledge. From the hardware side, data warehouses are exploiting the incredibly powerful processors and higher capacity memory chips now available at very low prices, sophisticated yet cost-efficient processor hardware architecture, the high-speed Computer Bus and controller interfaces, and the vastly shrunk disk drive. The skyrocketing power of server and desktop software, along with the availability of affordable and easy-to-use reporting and analysis tools have also played important roles in the development of data warehouses. What is more, the explosion of Intranets and Web based applications with the open Internet standards has greatly impacted data warehousing as well. On the other hand, there are some business and organizational forces at work. "Business process reengineering" and "downsizing" following the economic downturn of the late eighties have forced enterprises to reevaluate their business practices, and to identify and focus on their core competency areas. These phenomena and the globalization of business all have increased the need for continuous analysis and centralized management of data. Moreover, the emergence of standard yet flexible business applications, the existence of more technology savvy end users, and the heightened awareness of trends in information technology among mid and upper management levels all have significant bearing on the development and deployment of data warehouses.
Though W. H. Inmon started writing about the concept of the data warehouse as early as in 1981, the first data warehouse model was the Information Warehouse framework announced by IBM in September 1991. By mid-1994, numerous data warehouse products came on the market. According to META Group of Westport, Connecticut, by early 1996, over 90% of large companies had implemented or were planning to develop data warehouses.
Return to top
Key data warehousing attributes
Having looked at the data warehouse from a historical perspective, now we will turn to discuss its defining features contrasted with the attributes of operational applications.
To begin with, the data warehouse is oriented around the major subjects of the enterprise, e.g., customer, vendor, product and productivity. Hence it focuses on data modeling and database design exclusively, and it excludes data that is not useful for DSS processing. In contrast, the operational applications are designed around processes and functions, e.g., loans, savings, bank card and trust for a financial institution. Consequently, they are concerned both with database design and process design, and they contain data that satisfies immediate functional/processing requirements. A further ramification of the difference in orientation between the data warehouse and operational applications is their difference in the relationships of data. While data warehouse data spans a spectrum of time, maintains many relationships, and represents many static business rules (and correspondingly, many data relationships) between two or more tables, operational data maintains an ongoing relationship between two or more tables based on a business rule that is in effect.
Moreover, data contained within the boundaries of the warehouse is integrated, that is, stored in a singular, globally acceptable fashion, although the underlying operational systems may store the data in various different ways. Data warehouse systems prove most successful when data can be combined from multiple source applications, when all sorts of data inconsistencies have to be effectively addressed. "Data scrubbing" or "data staging" enables the DSS analyst to focus on using the data that is in the warehouse, without having to wonder about its credibility or consistency. The integration of data is manifested in many ways -- in consistent naming conventions, in consistent measurement of variables, in consistent encoding structures, in consistent physical definition of attributes, and so on.
Further, all data in the data warehouse is "time variant", i.e., accurate as of some moment in time, whereas in the operational environment data is accurate as of the moment of access. Thus, the time horizon represented for the data warehouse is much longer (which can involve years) than that for the operational environment (which ranges from the current values of today to ninety days). Every key structure in the data warehouse contains an element of time either implicitly or explicitly.
Lastly, the data warehouse is nonvolatile. Data warehouse data is a long series of snapshots, and cannot be updated once correctly recorded, while record-to-record real-time updates -- inserts, deletes, and changes -- are done regularly to the operational environment. That is, once data is loaded into the warehouse from the application-oriented operational environment (and/or external sources), it does not change, but is merely accessed there. Therefore, there is no need to be cautious of the update anomaly, an important factor to consider in operational application systems; nor does data warehousing require the complex technologies supporting backup and recovery, transaction and data integrity, and the detection and remedy of deadlock. Data "updating" in the data warehousing environment consists of periodic mass loading of data from the operational environment. The simplicity of data management and the much less rigid response time requirements allow data warehouse designers to take liberties in optimizing the access of data. De-normalization of the physical data model is conducted to enhance performance and simplicity, which are more prominent for data warehouse operations because the amount of data involved is typically very large.
Return to top
An architectural perspective on data warehousing
Inmon identifies five types of data in a data warehouse: older detail data, current detail data, lightly summarized data, highly summarized data, and meat data. Both older detail data and current detail data are stored at the lowest level of granularity, but the former is more voluminous and less frequently accessed than the latter. Thus, current detail data is usually stored on disk storage, whereas older detail data on some form of mass storage, e.g., magnetic tape, photo optical storage, RAID, micro fiche. Derived from the current detail data, lightly summarized data is the summarization of some attributes over some units of time. Like current detail data, it is almost always stored on disk storage. The highly summarized data is distilled from the lightly summarized data, and is compact, widely accessible, and most frequently used. Meta data is the data about data within the enterprise. According to Inmon, it contains (at minimum) the following three components:
the structure of the data, which serves as a directory of the contents of the data warehouse
the algorithms used for summarization
the mapping of data from the original sources to the data warehouse
Inmon also specifies a predictable flow of data within the data warehouse. As data is loaded into the data warehouse from the operational environment or some external source, it is transformed and then goes into the current detail level, where it is used in the calculation of the lightly summarized data and the highly summarized data or in some ad hoc queries. Then it is either moved to the older detail level through the aging process or simply purged.
Ken Orr outlines a Data Warehouse Architecture (DWA), broadly defined as "a way of representing the overall structure of data, communication, processing and presentation that exists for end user computing within the enterprise"(http://www.kenorrinst.com/dwpaper.html). It is comprised of the following interconnected parts:
Operational Data Base/External Data Base Layer
Information Access Layer, including the tools for manipulating, analyzing and presenting data
Data Access Layer, which is "responsible for interfacing between Information Access tools and Opeartional Data Bases"
Data Directory (also called the "Metadata") Layer
Process Management Layer, "the scheduler or the high level job control for" "the various tasks that must be accomplished to build and maintain the data warehouse and data directory information"
Application Messaging (also known as the "Middleware") Layer, involving the enterprise-wide transportation of information
Data Warehouse (Physical) Layer, where copies of operational and/or external data are actually stored
Data Staging (also called "Copy Management" or "Replication Management") Layer, including "all of the processes necessary to select, edit, summarize, combine and load data warehouse and information access data from operational and/or external data bases"
Return to top
Data warehouse products and services
The following are the links to data warehouse frameworks from the leading vendors:
HP's OpenWarehouse
Sybase's Data Warehouse Products
IBM's Information Warehouse
Software AG's Data Warehouse Intelligon
Data Warehousing with Informix
NCR Corp.'s (the former AT&T Global Information Solutions') Scalable Data Warehousing
Prism's Data Warehousing and Marts
Red Brick's PaVER Gold Program
the SAS Warehouse Solution
Siemens Pyramid's Smart Warehouse
Oracle's Data Warehousing
Links to vendors of various tools that can be used to build data warehouses and sites of providers of various services can be found at Larry Greenfield's "Data Warehousing Information Center