Sunday, February 01, 2009

Information service patterns, Part 3: Data cleansing pattern

Information service patterns, Part 3: Data cleansing pattern

developerWorks
Document options
Set printer orientation to landscape mode

Print this page

Email this page

E-mail this page


Hey there! developerWorks is using Twitter

Follow us


Rate this page

Help us improve this content


Level: Intermediate

Dr. Guenter Sauter (gsauter@us.ibm.com), Senior IT Architect and Manager, IBM 
Bill Mathews (bmathews@us.ibm.com), Senior IT Architect, IBM
Ernest Ostic (eostic@us.ibm.com), IT Specialist, IBM 

06 Apr 2007

The data cleansing pattern specifies rules for data standardization, matching, and survivorship that enforce consistency and quality on the data it is applied against. The data cleansing pattern validates the data (whether persistent or transient) against the cleansing rules and then applies changes to enforce consistency and quality. In this article, you learn to apply the data cleansing pattern within a Service-Oriented Architecture (SOA) context. This pattern specification helps you, as a data or application architect, to make informed architectural decisions and to improve decision guidelines.

Introduction

Information is one of the most strategic assets of any organization. Data quality is a key prerequisite for using information to gain insightful advantages in the marketplace. Poor data merely turns potentially highly valuable information into worthless byte streams. For example, incorrect address information for a "party" (such as a client, patient, or customer) inhibits strategic insight, preventing accurate and relevant insight from being surfaced. This kind of data can include information about whether two parties are the same or not, the total number of clients, and getting a 360-degree view of a customer. Sloppy data can decrease customer satisfaction, complicate communication, and increase costs when trying to circumvent the problem, as well as create other challenges.

Layers in a software architecture

A software architecture has four layers:

  • Database layer -- Located at the "bottom," it is responsible for persisting the data and providing create, read, update, and delete operations on the data.
  • Application layer -- Above the database layer, it provides the business logic.
  • Process layer -- It orchestrates the business logic through workflow management.
  • Presentation layer -- On the top, it delivers the visualization of all underlying layers to the end user.

Data quality issues are most severe when information is scattered across isolated and heterogeneous data stores. The heterogeneous and isolated nature of such an environment often goes along with an architecture with different formats and inconsistent values. Even within a single database, the quality of persisted data is not necessarily better if the appropriate rules are not enforced. Whether information is at rest in a data store or manipulated on the fly by an application, data quality is often either not enforced at all or controlled by different components using inconsistent rules that are embedded in some application code.

To turn information into insight and to leverage its significant value, data quality needs to be addressed by applying data cleansing consistently; that is, using consistent cleansing rules throughout the enterprise, not only in the database layer but also in the application and process layers.

After briefly describing the value of this approach, you'll learn the context in which the data cleansing pattern should be applied. Next, you learn about the problem and the solution approach for this pattern. Finally, you'll get a summary of the focus and risk areas, as well as the constraints of this pattern.



Back to top


Value proposition of the data cleansing pattern

Three major values that this pattern can deliver are:

  • Consistency and quality advantage
  • Reduced development and maintenance costs
  • Reusability advantage

Let's take a closer look at each of these.

Consistency and quality advantage

The most significant benefit when applying the data cleansing pattern is that it can improve the consistency and quality of information, whether it is persisted in a database or processed by an application. This pattern improves the quality of the data and ensures that it is high quality.

Applying the data cleansing pattern in an SOA context provides the business processes with the capability to manage and assure data quality "at the point of its initial collection." Applying data cleansing before the information is persisted allows for the incorporation of business-defined validation mechanisms at the point of entry, such as in data entry portals.

Reduced development and maintenance costs

This pattern provides recommended practices on how to specify cleansing rules and how to most effectively apply them to persistent and transient data. Many implementations of the data cleansing pattern provide sophisticated tools to develop, test, and deploy the cleansing rules. These tools can help reduce the overhead in many projects where cleansing rules are specified manually and have to be maintained in a laborious fashion.

Reusability advantage

An important aspect of the data cleansing pattern is its focus on reusability at an enterprise level. If each database and application implements its own cleansing process, this can result in inconsistent cleansing rules where the degree of data quality might be increased, but not in an effective or consistent manner and not to the level required. This pattern describes how the same cleansing rules can be applied consistently to a broad range of consumers.



Back to top


Context

The traditional context of the data cleansing pattern is the database layer, which is where it is most often applied. Based on increasing interest in SOA, we see new opportunities to apply this pattern in an SOA context.

Traditional, non-SOA context

The data cleansing pattern has typically been applied in the cleansing of name and address information, but it can also be applied to any freeform text, such as descriptions of products within inventory systems. Freeform text most commonly refers to data manually entered and not selected from a standard pick list or data entered without any formatting, such as a complete address in a single field. The data cleansing pattern is defined as the standardization, clean-up, and ultimately, matching (or de-duplicating) records based on the content of freeform text fields.

Definition of Master Data Management
The set of core data, critical to the business objectives of an organization, are what we call master data. This includes products or parties (such as customer, supplier, etc.) Because master data is so crucial for almost all business functions, it is often distributed across many applications. Master Data Management (MDM) addresses the efficient management, integration, and synchronization of master data across multiple applications.

In the traditional non-SOA context, this process is usually a batch function, performed periodically to meet the information integrity requirements of Master Data Management and to reach a single consistent view of important corporate entities. More traditionally, data cleansing has been also applied in many data warehousing projects. To support analytics, reporting, or providing a 360-degree view of entities such as the customer, data is collected from multiple silos around the enterprise. Those islands of information can result from independent divisions, product groups, or subsidiaries or a one-time event following a merger or acquisition. If in-house technology does not exist or additional enrichment of data with demographic or marketing detail is desired, collected data is often sent to outside firms for this cleansing. Some time later, perhaps days or even longer, the data is returned, presumably de-duplicated and enhanced. If processed in house, the traditional data cleansing pattern is usually a "point-in-time" function, performed nightly or weekly for general data clean-up or loading to operational data stores, reporting systems, or data warehouses.

Example of use

A major retail jewelry outlet uses the data cleansing pattern on a weekly basis to reconcile customer names from an array of systems that handle new customer accounts, consumer loyalty, lead generation, and billing. Each data entry point is a possible source of duplication for customer names or for multiple customers who may live at the same location. The approach to de-duplicating and making sense of such information is one example of applying the data cleansing pattern. Seen initially as a way to save on such simple things as postage, forward-thinking companies now rely on the data cleansing pattern to gain further insight into consumer spending patterns, to better identify large volume buyers, and to consolidate sales inquiries, customer support, and billing into one location to enhance the customer experience.

Figure 1 illustrates the high-level architecture of applying the data cleansing pattern in the traditional context.


Figure 1. Traditional context of the data cleansing pattern
Traditional context of the Data Cleansing Pattern 

SOA context

The SOA context for the data cleansing pattern takes advantage of sophisticated standardization and matching techniques and extends them to the front line of near-real-time applications. Viewed in this context, the data cleansing pattern allows an enterprise to extend its capabilities for validation and matching to the point of creation. Furthermore, the same de-duplicating and matching logic used in batch operations can be integrated with sophisticated search methodologies or used to enhance the capability to locate customer information when information or identifiers are either unknown or incomplete.

The SOA context for data cleansing allows for standardization and matching of individual request strings. A single name or address is dynamically cleansed, returned in a standardized format, or in the case of discovery, returned along with a set of possible candidates that are identified in the matching process. In data entry solutions, this improves data representation (consistent abbreviations of street types and states, for example) and increases the odds of finding a duplicate before it is persisted. Avoiding the problems caused by duplicates in advance is far less costly than trying to correct them afterward or suffering the consequences of mishandled accounts because a single view of a customer was not possible.

Example of use

A point-of-sale application, represented by the portal on the upper left in the example in Figure 2, might be used for a customer loyalty module managing profiles of common customers. In the case of the major jewelry chain mentioned earlier, this application subset is enhanced by the application of the data cleansing pattern in an SOA context. Imagine a visitor to the store who has forgotten or never knew their customer loyalty code. A casually entered (and potentially error prone) textual customer name is dynamically matched against the central master data store in real time, and a list of candidate names is returned. From the candidates, the actual customer profile can be found or verified as a new customer. The store clerk can now provide services earned by a loyal customer, such as free appraisals, and can suggest new gift purchases based on the profile, such as upcoming birthdays or anniversaries. The customer gets extra attention and feels like she is being treated as someone special. The potential revenue benefits and customer experience are realized by deploying data cleansing as a service. Matching and standardization rules are reused, taking advantage of the same functions applied during batch runs.


Figure 2. SOA context of the data cleansing pattern
SOA context of the Data Cleansing Pattern 


Back to top


Problem statement

The data cleansing pattern addresses the challenge of improving data quality and ensuring data consistency at a metadata level as well as at the data level itself. Typical causes of insufficient and poor data quality are:

  • Data entry (typo) errors
  • Metadata definitions (data models) are too loosely and not consistently specified
  • Integrity constraints are not (appropriately) defined or enforced

For example, the definition or constraint that a postal code must be a valid number may be missing or not enforced consistently. Implementations may fail to check whether the number is valid or whether it is a number at all. Formats to represent the same real-world entity might be conflicting (for example, number compared with string to represent a postal code). As just described, inconsistencies can surface at a metadata level, as well as at the data level itself. Even if the data models are defined appropriately and consistently, a lack of appropriate integrity constraints for data values can lead to quality and consistency issues. The same real-world entity might be represented with different data values, such as different part numbers for a product or different weight measurements. Some of the most common problems include:

  • Lack of separation of values (for example, complete address in free form field without any indication where street ends and city begins.)
  • Lack of standards for data formats and values such as:
  • Data types (integer or varchar, for example)
  • Text formatting ("123-45-6789" or "123456780" or "123 45 6789")
  • Abbreviations ("IBM" or "I.B.M." or "Int. Bus. Machines" or "International Business Machines")
  • Level of abstraction and granularity ("Massachusetts" or "Suffolk County")
  • Required attributes (title for a person) or parts of attributes (type of an organization within its name (such as "IBM" or "IBM Corporation")
  • Lack of consistent values for identifiers
  • Incorrect placement of values into attributes (value of ZIP code appears in phone number attribute)
  • Incorrect values due to incorrect data entry or outdated information ("Somers, CT 10589" where the postal code "10589" is not in the state Connecticut but in New York)
  • Duplicate records due to inconsistent values in one or more attributes


Back to top


Solution description

The design time characteristics of the data cleansing pattern pivot around establishing standard rules for conversion and cleansing of the data sources, defining matching criteria to support de-duplication, and identifying how to determine the most current or correct data. As you might imagine, design is the most critical and complex phase in the data cleansing process. Once this task is complete, the application of the cleansing, matching, and survivorship rules are used in the run-time processes.

Design time characteristics

Designers that apply this pattern must specify cleansing rules, possibly supported by appropriate tools. This activity can be categorized into four major steps:

  1. Parsing of input data and association to standard and fine-grained elements
  2. Standardization of data
  3. Matching and de-duplication of data entries
  4. Survivorship of the correct information

As indicated in the problem statement, data values might be represented in free text or some aggregated fields (the attribute street may capture house number, direction of the street, as well as the street name). Based on the understanding of what data is actually captured in a field, the first step is to determine the algorithm for separating data values and assigning them to the most appropriate elementary attributes. This requires domain specific knowledge (for example, streets may have a direction indicator in the U.S., such as "1007 North Main Street", which is uncommon in Germany.

After data values are correctly assigned to attributes, the designer needs to specify how to standardize the values. This means that the designer needs to find answers to questions such as:

  • Should the text be upper case or mixed case?
  • Should the numbers be converted to the appropriate data type (such as "nineteen" into "19")?
  • Should numbers in the postal code field represent a correct postal code?
  • Does the postal code match the state (and the city)?
  • Does the complete address (street number, street, city, state, postal code) exist?
  • What will the standard representation for a name (such as "Bob") be? (This step is to identify duplicates, not to propose a correct name, which may very well be "Bob" and not "Robert.")

Some of the standardization rules are straightforward and do not require significant effort, such as converting mixed-character data into upper case. Some are relatively advanced and require access to a database that stores correct values, such as the correct association between postal code, city, and state in the U.S. Standardization rules may also have to be context sensitive: a string such as "St. Virginia St." is determined to have a street name of "St. Virginia" and a street type of "street" (assuming that this is a U.S. address). "St." and "St.", while technically are the same, they have a different meaning when interpreted through the eyes of an intelligent rule set.

In many cases, the designer has to identify potential duplicate records. Unfortunately, even after standardizing, the data values the records are often not identical. In one record, a person's name might be "J. Smith" and in another record it might be "John Smith". One of the challenges when identifying a match is to determine how likely it is that "J. Smith" is the same as "John Smith". Obviously, this depends on the other information contained in the record. If the address is exactly the same, it might be very likely. It will also depend on how many people have the same first name and last name: it is less likely that there are two "April Back-Cunninghams" in the same city than there are two "Robert Johnsons". The same pattern that applies to parsing and lexical analysis of freeform names and addresses can be applied to standardize and de-duplicate product listings or parts for enhanced inventory control. It should be noted that we have used mostly U.S. addresses, but this approach certainly applies to any country and not to just addresses.

There are two methods to address matching: deterministic matching and probabilistic matching. Deterministic matching is based on business rules and algorithms to define a match. The advantage of this method is that it provides a clear answer such as two records do or do not match. However, the rule set is typically limited to rules and algorithms that are categorized with a degree of simple or medium complexity. The probabilistic method leverages statistical algorithms and fuzzy logic to indicate a match. This approach uses more powerful mechanisms to identify a match and provides the probability that records match, such as 93%. The degree of confidence in the match should be balanced against the value of the information being addressed and the cost to determine the match.

Based on the matching rules, the designer specifies survivorship rules that determine which record and attributes of a record reflect the correct information and are carried forward and those which need to be discarded.

Now that we've completed the core part of specifying the cleansing rules, the final step is to specify how to gather the input data (using database extract / query, service request) and how to provide the output of the cleansing process (apply to a database or provide as a service response).

Figure 3 shows an overview of this activity.


Figure 3. Design time aspects of the Data Cleansing Pattern
Design time aspects of the Data Cleansing Pattern 
Definition of data profiling

Data profiling is the analysis of databases to better understand their metadata. The goals are to:

  • Discover unspecified metadata including any relationships within and across the databases.
  • Check the accuracy of specified integrity rules.
  • Make recommendations for a more appropriate data model.

The latter is particularly important for the definition of an integrated data model over heterogeneous sources.

It is important to note that the data cleansing pattern is often applied together with other patterns; the green boxes in Figure 3 are such an example. For the developer or designer to specify cleansing rules, a sufficient understanding of the data sources against which the data cleansing pattern shall be applied is necessary. This includes identifying and understanding the information semantics, such as the meaning of the data model elements as well as structural information. Data profilingwill help to derive this knowledge from the underlying data sources. In many cases, the data cleansing pattern is applied together with the data consolidation pattern. In such a scenario, the mappings between data elements from the sources into the target need to be specified (calledintegration modeling in Figure 3).

Run time

The data cleansing service receives data with an undetermined level of data quality as input. Typically the service is either invoked with this input as a parameter of the service request (by value) or the service gathers the data from one or more specified source (by reference). The service then applies the cleansing rules against the source data. Depending on the complexity of the data cleansing rules, this process may require a lookup within a database or a dictionary to validate the correctness of information (such as a correct combination of postal code, city, state). Upon successful completion of the processing, the cleansed information is then returned as a reply by the service. In a traditional context, the output is most often applied to a database or further processed in a data consolidation process. (See Resources for more articles about data consolidation.)

Data cleansing services frequently require a high degree of sophistication implemented by a data cleansing server that can provide high levels of performance and scalability through the exploitation of parallel processing. This server is then able to process and cleanse extremely large data volumes in batch mode as well as single records in a real-time invocation environment. The requirement to process large data volumes is illustrated by a scenario involving the consolidation and cleansing of the complete content of multiple legacy systems. At the other end of the spectrum, the data cleansing server may be invoked through a Web service by a portal application to check an address that was entered on the screen. In this situation, the server may need to respond in real time to a high number of concurrent requests.



Back to top


Considerations

When applying the data cleansing pattern, it is important to understand how it impacts the following nonfunctional requirements.

Frequency of transaction execution

The ability of the cleansing service to execute a data cleansing transaction at a high rate is determined by the rate at which the data cleansing server can access the input data and apply the cleansing rules. The more complex the cleansing rules are and the more lookups are required, the longer the execution of the cleansing activity will be.

A data cleansing server that can exploit parallel processing capabilities will have a significant advantage over other approaches; IBM® WebSphere® QualityStage (see Resources) is one such example of an off-the-shelf data cleansing server.)

Performance/transaction response time

The transaction response time (applying cleansing rules against input data and returning the output) is determined by the complexity of the cleansing rules and the ability of the cleansing server to efficiently process the data. Cleansing server implementations that can compile the rules and exploit parallel processing capabilities will perform better than alternative implementations.

Data volume per transaction

It is fairly common to apply the data cleansing pattern against sets of bulk data as well as individual records. Therefore, the data cleansing server needs to be able to scale in order to process large data volumes.

Transformation capabilities

The data cleansing activities (parsing or separation of values, standardization, matching and survivorship) are specified as cleansing rules. They are ultimately applied to the transformation of input data, which may be of low quality, into an output with a higher level of consistency and quality. Since these transformation rules may be numerous and complex, many implementations of the data cleansing pattern deploy the cleansing rules as transformation operations using a data cleansing server.

The transformation capabilities of the data cleansing pattern are specialized and focus upon improving data quality and integrity by standardizing and matching data. More general data transformation approaches -- such as described in Part 2: Data consolidation pattern -- focus on the exchange and the reformatting, splitting, and merging of data and do not have sophisticated support for data quality.

Transformation requirements are most frequently influenced by the variety of data sources and as such, the capability to define the characteristics of complex transformations is key. The more complex and varied the transformation requirements, the more sophisticated the run-time transformation or the data cleansing server must be.

Type of source model, interfaces, protocols

Product implementations of the data cleansing pattern vary in the range of formats they can support for input data. Applied fully and correctly the data cleansing pattern removes the complexity of various source formats, interfaces, and protocols so that developers only need to care about one model, interface, and protocol.

Solution delivery time

Product implementations of the data cleansing pattern frequently provide very sophisticated tooling support to specify the cleansing rules and to visualize the effectiveness of those rules on the input data. For example, which records are identified as a match based on the probabilistic matching rules? Many of these implementations have predefined (cleansing rule) operations that are provided out-of-the-box with the products. This feature allows the implementer to apply this approach very efficiently in a short period of time.

This pattern is often applied against data with a very low degree of quality and consistency. As a result iterative refinements of the cleansing rules may be required. The effort involved is directly related to the characteristics of the problem at hand and not the pattern approach.

Skill set and experience

There are two primary skill sets and experiences required for data cleansing. The analytical skills necessary to understand and define the nuances associated with the data sources and targets with respect to format, values, ranges, and other characteristics are important. If a tooling approach is used, the analyst will require product specific knowledge in order to define the cleansing rules. Second, the developer will need to understand SOA concepts, standards, and technologies necessary for implementation of the cleansing services. In addition, if a tooling approach is used for a run-time server, the developer and system architect will require knowledge and experience to ensure that the server is scalable to meet the required service-level agreements.

Cost of development

The development costs depend largely on the complexity of the data cleansing task. The costs are very low if only basic standardization rules need to be applied. The more complex the cleansing rules become, the higher the implementation development costs will be. The higher costs are associated with the data analysis, along with the iterative development and testing cycles necessary to address the complexity.

Reusability

Reusability in the data cleansing patterns is realized through the definition of cleansing rules that can be applied at the record level through a service or through a batch process for bulk data. The second opportunity for reuse is through the use of a common server process for the execution of the cleansing rules.



Back to top


Conclusion

The data cleansing pattern specifies a recommended practice for how to improve the data quality of persistent data either at entry or later.

Focus areas to apply the data cleansing pattern

  • Improve data quality and consistency for critical information needs, such as customer name and address. As described earlier, the data cleansing pattern can turn inconsistent (and hence, useless) data into a valuable strategic asset.

Risk areas to apply the data cleansing pattern

  • The lack of a common understanding and definition of core business data.
  • Lack of stable business guidelines and governance for data quality. The specification of data cleansing rules needs to be based on guidelines that are approved by business experts and that are largely stable. If quality guidelines change too frequently or are not approved, the maintenance of cleansing rules and a constant deployment of new rules can lead to significant workload.
  • Lack of cross-departmental support, especially if the information is accessed and modified by applications that cross departmental boundaries. Although a single organization can improve the quality of the data that it owns, the data may be altered by applications outside of its sphere of control, which can result in quality degradation.


Back to top


Product mapping

The following IBM products implement this pattern:

  • IBM Information Server is a high-volume data integration platform for data cleansing, transformation, and relocation.
  • WebSphere QualityStage is one of the core components of this portfolio that provides data cleansing technology. WebSphere QualityStage supports the standardization, enrichment, and matching of freeform text data. Critical for Master Data Management solutions, WebSphere QualityStage enables the linking and de-duplication of records across or within data sources by applying sophisticated parsing rules and statistical matching functions. The tool also enables the selection of a single "best" record for survival across sources by automatically cross populating entity values that are blank, missing or incomplete, for a single comprehensive view of data that spans multiple systems.
  • WebSphere Information Services Director (also a component of IBM Information Server) exposes information management capabilities as services. It packages information integration logic, cleansing rules, and information access as services. This capability insulates the developer from the underlying provider of this functionality. Most important is its capability to expose data cleansing through a service-oriented interface such as Enterprise JavaBeans (EJB components), Java™ Message Service (JMS), or Web services. This product provides the foundation infrastructure (including load balancing and fault tolerance) for information services. It realizes the Information Service Enablement component illustrated in Figure 2. The WebSphere Information Services Director is built upon the same powerful metadata infrastructure as WebSphere QualityStage.

Acknowledgements

The authors wish to thank Jonathan Adams and Lou Thomason for their support in writing this article and in developing this pattern.



Resources

Learn

Get products and technologies

Discuss


About the authors

Guenter Sauter photo

Dr. Guenter Sauter, senior IT architect and manager, leads the architecture team in the Information Service Competency Center in IBM Software Group. This team provides patterns and guidance on the information perspective of a Service-Oriented Architecture based on client engagements and experiences.


Bill Mathews photo

Bill Mathews is a senior IT architect in the IBM Financial Services Sector for the Americas and is the architectural lead for Information Integration. He has over 25 years of experience in the IT industry, is an Open Group Master Certified IT Architect and holds IBM IT Architect and Consultant certifications. His areas of expertise are information integration, enterprise application integration, and Web application development. Bill holds a Bachelors of Science degree in Computer Science from Hofstra University and a Masters of Business Administration degree from Union College.


Ernest Ostic photo

Ernest Ostic is a product specialist at IBM, focusing on solutions for real-time data integration. He has been with IBM, and formerly with Ascential Software, for over nine years in various roles in product management and sales. He is currently involved with strategies related to SOAs for the Information Server product line. He is a graduate of Boston College.