Job ad: “Wanted: truck driver to drive a trailer load of tropical plants from Atlanta to St. Petersburg. Must know optimal planting conditions, desired soil characteristics, drought tolerance, and disease resistance of each of the 65 plant species on board.”
Why data architecture?
There were two motivations for this paper. First, I am often amazed and disturbed by some of the job listings I see posted on the internet job boards because they demonstrate a prevalent misunderstanding of what data architecture is, and what data architects do. Much like the fictitious listing for a truck driver, above, so many technical job listings do not correlate with the title of the job being advertised. The position of “data architect” (and by obvious extension the discipline of data architecture) has been particularly misunderstood and abused.
Like the truck driver who is expected to have a degree in horticulture in order to properly deliver a trailer-load of plants, many companies seem to expect a data architect to know everything from Java programming and Websphere, HTML, and XML, to computer hardware and networking protocols. Whenever I see such “requirements” for a position labeled “data architect” I know immediately that the company doesn’t understand what data architecture is, and what a data architect really does.
This paper was written with the belief that most companies, large or small, can benefit from (at the very least) an understanding of data architecture, if not a full-blown, robust, implementation of an architecture which has been designed and customized expressly for their business model and information needs.
By better understanding what is entailed in designing and delivering data architecture, companies can decide how much formalism they need, and choose, cafeteria-style, which aspects of data architecture they want to implement. They will also be better prepared to know what they really want and need in a data architect.
Jan Popkin, the CEO of Popkin Software, having been asked at a conference to define Enterprise Architecture, was afterwards inspired to contemplate the issue:
As I thought more about this, I realized that if I had asked nine other people attending this speech to define enterprise architecture, I would have gotten nine different answers. Object-oriented developers would talk about how enterprise architectures can horizontally link class libraries. Another programmer might define it as code construction. A software architect would describe blueprints for their existing IT data and systems. Enterprise [data] architects would delve into areas such as business goals, corporate metadata, relational databases or Web services.
These definitions illustrate that people define enterprise architecture through the narrow context of their job, not from the enterprise level. Thus, they lack a key element of enterprise architecture: how information technology supports the business goals and mission. 
The second motivation stems from many casual conversations with friends and acquaintances. I have often been asked what “data architecture” is, and I realized that
- It is not something that can be easily captured in a sound bite, and
- The ignorance of the discipline seems to be widespread, even (as I have observed, and Mr. Popkin alludes) among Information System professionals
These realizations sparked in me a curiosity as to why this should be. After all, every system that does something non-trivial must deal with the care and feeding of live databases. Everyone who has ever participated in serious IT projects has encountered data flow diagrams, and relational databases. All significant IT initiatives have struggled with issues of data integrity and data quality.
Any company that has tried to make a commercial ERP or CRM package work with their existing data has discovered the “square peg/round hole” phenomenon. And anyone who has ever built (or attempted to build) a data warehouse has experienced first hand the difficulties accompanying the task of integrating data from several different sources. ETL often degenerates into a brute-force effort to load the data, and once it has been loaded, cross your fingers and hope that it all plays nicely together in the warehouse.
Why, then, is data architecture so under-utilized and under-evangelized? I theorize that it is because
- It is not exciting like some of the bleeding-edge technologies that IT professionals encounter
- It is highly abstract
- It has no sexy user interface
- It has not enjoyed much favor among technical publishers
- It hasn’t made anyone a millionaire like some of the flash-in-the-pan internet technologies did
Although I knew it wouldn’t make me a millionaire, I decided to take the plunge and rectify these inequities. I can’t make data architecture sexy, and I haven’t found a cool graphical interface, but perhaps I can make it exciting (or at the very least, interesting).
And perhaps I can make it less abstract by breaking it down by the deliverables and artifacts which emanate from the process of creating data architecture. If I discuss the benefits of data architecture in terms of the value it can provide for a company (real dollars!) it is sure to pique your interest, right? Maybe you will get excited after all!
A fundamental proposition of this paper is that data architecture is a good thing for all companies, and an absolute necessity for some. But I understand that a robust implementation is an expensive proposition so I don’t want to insist that it makes sense for all companies.
For many large companies it will bring a tremendous payback. The return on investment will be dramatically evident.
But for many medium or small-sized companies, it might make more sense to remediate their data environment, using some of the principles of stewardship and semantic clarity, rather than investing large sums of money on architects, software tools, and a massive effort to overhaul their basic infrastructure.
All companies, however, no matter what size or what industry, need to understand the advantages of a formally defined data architecture (and the perils of not having one) so that they can make an informed decision as to whether they need it or not, and how much makes sense. Briefly, some of the benefits are:
- Superior data quality
- Unquestionably reliable data and reporting
- An infrastructure which is tailor-made for data warehousing
- An information system which supports integrating data from disparate legacy systems, or from purchased software packages
- A clear understanding of how all critical business concepts are captured in the data throughout the enterprise
- An environment that fosters and promotes collaboration among business units and discrete information systems
Obviously the potential detriments of not having a formal data architecture would be the reverse of the list above: uncertain data quality, questionable data and reports, an environment that does not support data warehousing, integration, or collaboration, etc. The benefits translate directly into dollars which flow to the bottom line. Conversely, not having the benefits of a well-designed data architecture can adversely affect company profits, and in most cases the company won’t even be aware of these “hidden” costs.
A fundamental purpose and inherent result of data architecture is to foster and promote high data quality. Virtually all of the other benefits accrue from this one. Impeccable data quality can synergistically provide a multitude of benefits, and, surprisingly, as Larry English likes to say, “High data quality is free!”
Poor data quality can result in poor customer service, incorrect invoices, ineffective ad campaigns, bad decision-making, missed opportunities, blurred corporate vision, and even operating under the wrong business model! A company which doesn’t know what it doesn’t know, is doomed, through its own inertia and ignorance, to continue down a sub-optimal path into becoming either a poor performer, or, worse, going out of business. Yes, poor data quality can have such dire consequences, and a well-considered data architecture will help you avoid them.
What is Data Architecture?
“Data architecture is where the rubber meets the sky.”
– Neil Snodgrass, Data Architecture Consultant, The Hackett Group
Even among IT practitioners, there is a general misunderstanding (or perhaps more accurately, a lack of understanding) of what Data Architecture is, and what it provides. In general, Data Architecture is a master plan of the enterprise data locations, data flows, and data availability. It is a conceptual infrastructure to support data quality, data stewardship, data integration, data migration, and system collaboration. This infrastructure embodies a set of guidelines and standards which ensure that the data assets are managed appropriately, and that they conform to sanctioned principles for stewardship and quality.
Data Architecture is the discipline of designing, creating, and maintaining this infrastructure. It must accommodate the data and information needs of the company and do so in a manner which promotes high reliability and easy data integration among applications and data repositories.
The most visible and tangible product of effective Data Architecture is a reporting environment that
- Provides a single version of the corporate “truth”
- Allows business analysts to discover new insights, and
- Allows business executives and corporate decision makers to derive corporate strategies and actionable tactics from their data. Such a reporting environment usually entails one or more data warehouses, and one or more departmental or “competency” data marts.
The architecture describes how data flows from corporate transactions, through the various layers of transformation and integration, through operational data stores, all the way to the decision-support applications that query the data warehouse or some other data structure optimized for reporting and analytics. It is an infrastructure that, when properly implemented, (i.e. follows the architecture and conforms to the corporation’s suite of “best practices”) guarantees the three benefits of the reporting environment described above.
As the humorous quote at the beginning of this section indicates, Data Architecture often seems somewhat nebulous as there is no physical manifestation (like an executable program manifests programming code, or like a relational database manifests an entity relationship data model).
Data Architecture has no programmatic instantiation and exists only as standards, policies, and corporate “best practices.” It resides only in the artifacts (text documents and graphic diagrams) which describe it, and in the “tribal knowledge” of the enterprise. The artifacts which describe it are the blueprint of the architecture, and serve a similar function for building reliable systems as a building architect’s blueprint serves for building a house.
A corporation’s Data Architecture is a mirror of the data and information generated and captured by the enterprise in order to do its business.
- It describes the business rules and the concepts which are critical for the enterprise to operate efficiently.
- It offers a “seal of approval” on the reliability of the data, and guarantees that corporate decision makers can make well-informed, fact-based decisions on policies and strategies.
- It provides for a sanctioned plan for stewardship of the data assets of the corporation, and details how data gets created, how it moves through the enterprise, and how it gets consumed.
Indeed, Data Architecture influences everything in the enterprise which “touches” the data. It motivates data polices, influences corporate goals, enables strategies for achieving those goals, and validates the tactics which implement those strategies. It encompasses all systems and programs in which data originates, in which data is transformed and/or cleansed, and to which data is migrated, or with which data is integrated.
By standardizing data definitions, data formats, and the acceptable storage, integration, and usage of the data, the architecture prepares the environment for data management, and it is by invigorating these standards that the powerful benefits of the Data Architecture (high data quality and unquestionable data reliability) are enabled. Also, by dictating how data gets integrated, migrated, cleansed, and transformed, Data Architecture provides a plug-and-play framework for data warehousing.
What are the artifacts and deliverables of Data Architecture?
Since Data Architecture is a conceptual and abstract discipline, it has no simple representation that one can point to and say, “That’s Data Architecture.” Data architecture serves and encompasses everything a company captures and maintains, in the realm of data and information (see Figure 1).
Having such a broad scope and impact, and such a high level of abstraction, it requires some seasoned imagination to conceive and understand what it is all about.
The one artifact that comes closest to capturing the essence of Data Architecture is a high-level data-flow diagram (Figure 2). But data flow is only one aspect of a complete architecture. There must be rules about how data flows or migrates through the information systems, and there must be a crystal clear understanding throughout the IT realm of which subject areas and concepts are important to the company’s business model. In addition there must be an enterprise-wide agreement as to the semantics of those concepts in all possible contexts (within the business model).
Since a fundamental goal of the architecture is to have absolutely unquestionable data quality and reliability, semantic clarity is the first step; but disciplined stewardship of the data, the concepts, and the business rules is the only way to move forward, past that first step, to achieve a robust and effective architecture.
In order to complete the picture, and implement the type of data environment which an ideal Data Architecture provides, there must be:
- Inspired analysis and design of the overall architecture
- Corporate sanction of the architecture’s goals
- Enforced compliance with the architecture’s rules
Artifacts of Architecture
The following deliverables and artifacts of the Data Architecture are designed to ensure that these three principles are delivered to the information systems which are destined to utilize the architecture. This is not a mandatory or an all-inclusive list. It is simply a recommended methodology, and does not preempt a different approach utilizing other documents and principles to achieve the desired environment.
Business Concept Definitions
Having corporate sanctioned definitions for the concepts which animate a company’s business model is the single most important element of Data Architecture. None of the major benefits of the architecture will accrue without them. Yet business concept definitions are often overlooked (or worse, purposely ignored) because (to many IT practitioners) it seems painfully like “documentation for documentation’s sake”. Nothing within the realm of enterprise data could be further from the truth.
Semantic clarity is mandatory for getting the full utility and all of the collateral benefits of enterprise Data Architecture. Unless all systems and programs agree on a single definition for each and every critical business concept, then there can not be any reliable data migration, data integration, data cleansing, or data warehousing. Analysts and executives who query the data warehouse(s) would have little or no reason for confidence in the accuracy of the information which is presented to them.
Data Stewardship Agreements
Stewardship is a vital element of any Data Architecture. Data stewards ensure the quality, accessibility, and protection of the data, and define the data standards (data definitions, concept definitions, data formats, and data domains). They are the guardians and maintainers of the Data Architecture. They ensure that there is a single data store of record (DSOR) for the vertical stripe of data which they are stewarding, and they prohibit non-conforming data silos from participating in the architecture.
Stewardship agreements are corporate documents that grant stewardship responsibilities to a person, initiative, or department, and need the advice and consent of the CIO or a CIO designate. Stewards are typically positioned at a high or mid-level of corporate responsibility, e.g. Director or Manager.
Data Sharing Agreements
Data sharing agreements are corporate documents that describe the data, where it is located, who protects it, and who can access it. Most data should be freely available throughout the enterprise. But some sensitive data needs to be restricted. The data sharing agreement, signed by all interested parties describes who can access the restricted data, when it is available, and how the access is accomplished.
Even data that is not sensitive needs to be certified as “sharable.” Entities within the enterprise that want access to the DSOR for a concept need to be certified as conforming to the standards maintained for that concept (see Data Standards, below).
Data Usage Models (Stewardship Matrix)
Anyone who has been in Information Systems very long has heard of, and probably used, a diagram known as a CRUD matrix. CRUD stands for ©reate, (R)ead, (U)pdate, and (D)elete, and details the data usage for an application, a system, or an initiative. The Data Usage Model (sometimes called a Stewardship Matrix) extends the old-fashioned CRUD matrix so that one can, at a glance, not only see how each application interacts with a given concept, but which application data store is the data store of record (DSOR) for each concept. The system which has the DSOR for a concept inherits the stewardship responsibilities for that business concept, and is obliged to:
1. Get enterprise-wide agreement of a definition for that concept
2. Document all of the business rules that pertain to the concept
3. Determine who (which systems and employee types) can see and use that data (via Data Sharing agreements discussed above), and
4. Maintain the integrity of the concept (by setting enterprise-wide data definitions, data formats, and data domains for the concept).
Data Standards (Definition, Format, and Domain)
Data definitions are often captured in modeling tools like Erwin, and then propagated to the physical database in the form of comments on tables, columns, and relationships. They quite frequently can come directly from the Business Concept Definition document (see above). The DSOR for a concept contains the sanctioned definitions which relate to the concept and its attributes. Similarly, the DSOR should be considered the sanctioned format for the data attributes for a concept, and for the valid domain values for that concept.
An important criterion in data sharing is to make sure that all parties which want to use the data must define that data in exactly the same way — in entity and attribute definitions, in format, and in domain values. This is crucial to having certifiably correct reports, and a high level of certifiable data quality.
Where definitions, formats, or domains are different, it is hard to rationalize that both sides of the data sharing are, indeed, talking about the same concept, and before a sharing agreement can be executed and sanctioned by the enterprise (with signatures of appropriate parties) one side or the other must change and conform to the other (or both sides can change and use a negotiated settlement to remediate the differences).
Data Flow Diagrams
Many in Information Systems think of data flow diagrams (DFD) as being equivalent to Data Architecture — as being The Architecture. DFDs are a vital tool for conveying the scope and boundaries of the architecture, but, (as we hope we have demonstrated in this white paper) they are only a tool, and only one of many.
DFDs describe how data flows throughout the enterprise — from creation of the data, through various layers of refinement, cleansing, and transformation, to the consumption of the data on reports, executive dashboards, or display screens. They are a key to documenting the overall architecture, and are a very useful starting place for the data mapping used by cleansing initiatives or for ETLs which load the data warehouse.
Conceptual models are diagrams that summarize all of the critical and interesting concepts which are inherent in the business, and the relationships among them. A very high-level conceptual model diagrammatically details only the subject areas (e.g. Finance, Human Resources, Products, etc.) of interest, and the relationships between subject areas and concepts. This type of model is called, naturally, a Subject Area Model.
The next lower level of detail is captured by a concept model (sometimes called a data planning model) which depicts each interesting concept and the relationships among the concepts. One method of portraying this model is with an un-attributed entity relationship (ER) model. Indeed, most (if not all) of these business concepts will end up being fully-attributed entities in one or more logical models which support one or more transactional systems. The relationships between concepts in this type of model conform naturally enough to the concept of relationships in ER modeling.
Another very effective technique for conceptual modeling is a formal modeling notation known as Object-Role Modeling (ORM). Object-Role modeling was designed for this purpose, and allows useful insights into the concepts and relationships which might be overlooked using the traditional ER modeling notation. ORM is sometimes eschewed as being too tedious, but this is due mostly to a lack of good graphical tools designed to support the technique.
If you have undertaken the discipline of creating conceptual models, you will find that the logical models evolve from the conceptual ones quite naturally. The major concepts become entities, and many of the minor ones become attributes for those entities.
Physical models are dependent on the choice of DBMS used, and are in the domain of the DBAs. Whereas the physical representation is definitely an artifact of the architecture, its main purpose is to document where (what DBMS, what database, and how the concepts and entities had to be modified (if at all) in order to become a column in a table. The physical residence of business concepts is an important piece of information for Data Sharing Agreements.
Data Warehouse Artifacts
Data warehouses have many artifacts and deliverables. All of the artifacts and deliverables mentioned here for Data Architecture will be utilized in building a data warehouse.
Metadata Standards and Maintenance
Metadata is the sum of all of the corporate knowledge about the corporation’s business processes and the data that qualifies and quantifies it. There are two types of metadata: technical and business.
Technical metadata is used by Information Technology practitioners to standardize, categorize, and define the data structures used to capture information in databases. Technical metadata describes the physical properties of the data, how it relates to other data, and mappings between sources and destinations of data that is moving through the system(s). It is invaluable for standardizing the data formats, definitions and domains across systems.
Business metadata is used to guide the system users (data consumers) through the data and the problems they are trying to solve with it. It provides, on a fundamental level, basic description information for the data fields. At a more robust level, it provides the foundation for understanding the content and source of the information. The business metadata provides a conceptual context for the technical metadata, and is often undocumented, only to remain as “tribal knowledge.” Accurately capturing and standardizing business metadata is always an important challenge for Data Architecture.
What can we expect from implementing Data Architecture?
At the very least, Data Architecture provides a high-level map of the data topology for an enterprise. It describes how the data originates, where it resides, where it migrates, what transformations are applied to it to cleanse and standardize it, and what it means (the semantics). This information, alone, is “worth its weight in gold” by allowing management as well as technicians to understand the data karma of the enterprise.
At its best, it goes way beyond this simple documentation, and becomes an active principle that lives within the data, energizing and leveraging it in a multitude of ways. The data becomes an organic corporate asset that invigorates and motivates the enterprise, and provides a clear path to the realization of the corporate vision, goals, and strategies.
To someone that has never experienced a robust and inspired Data Architecture in action, this may sound a little like poetic license or hyperbole. But it truly is not. Metaphors aside, corporate personnel who discover the synergistic benefits of Data Architecture for the first time, are often amazed at how they ever functioned without it.
With a well-considered data architecture, data that once was suspect or needed “tweaking” in order to balance the books, becomes as reliable as “Old Faithful.” Analysts who once complained that the reliability of the data made their analysis contrived and incomplete, become ardent converts — often clamoring for more bandwidth to allow their heuristics to discover all of the exciting possibilities that are contained in their newly conformed data warehouse.
Data warehouse developers who previously spent many hours of overtime trying to shoe-horn data from legacy systems into the warehouse, happily discover that ETLs and data maps become self-revealing, and the data warehouse is found to be the software equivalent of “plug-and-play.” Executives who had struggled to find meaning in their daily, weekly and monthly reports, now discover nuggets of information which inspire new visions, and blaze new trails to outsmart and outmaneuver the competition.
Because of guaranteed data reliability and the framework which enables death-defying data transformations, Data Architecture can have a positive impact on virtually every operational function, every department, and every profit center.
The artifacts describe how this should happen: Data Stewards enable semantic clarity and enforce the standards. Data analysts and planners set the policies and discover the vision. Program and project managers instantiate the ideals.
Data integrators become empowered to fold all data into a single vocabulary, whether they are dealing with existing disparate systems, new system development, or third-party packaged system. And everyone throughout the enterprise finds a new appreciation and respect for the data that pulses through the architecture’s veins.
Optimal Data Architectures are flexible and can be implemented in stages. The key is to have a high-level plan which accounts for the goals and aspirations of the enterprise. Once that is in place, the benefits of Data Architecture can be prioritized and implemented in a seamless, phased-in approach that accommodates the specific needs of any organization.
 Quoted from an editorial comment by Jan Popkin in SDTimes magazine, May 15, 2002
 Larry P. English is a noted advocate and lecturer on data quality, and has written “The Bible” on the subject, Improving Data Warehouse and Business Information Quality, Wiley. In this book Mr. English states that “Quality is free. It’s not a gift, but it is free. What costs money are the unquality things — all the actions that involve not doing jobs right the first time.” And, “Every penny you don’t spend on doing things wrong, over, instead, becomes half a penny right on the bottom line. If you concentrate on making quality certain you can probably increase your profit by an amount equal to 5 to 10 percent of your sales.”
 A data warehouse can be built without Enterprise Data Architecture, but it is highly inadvisable. Likewise, a data architecture can exist for an enterprise that is not doing any data warehousing, but it provides the optimal benefit to the corporation when it establishes the blueprint for integrating disparate enterprise data into a data warehouse.
At the time this paper was written, Ralph C. “Rusty” Alderson was a Senior Consultant with Third Coast Software Foundry, Austin, Texas, specializing in Data Architecture and data-related issues. He is retired now.
If you liked this, you are guaranteed to like this!:
The Importance of Semantics in Data Warehousing
Early on in my career as a data architect, I discovered that my background as an English Literature major was not a…
©️ Rusty Alderson, 2019, All rights reserved.