|
Level: Intermediate Dr. Guenter Sauter (gsauter@us.ibm.com), Senior IT Architect and Manager, IBM 06 Apr 2007 The data cleansing pattern specifies rules for data standardization, matching, and survivorship that enforce consistency and quality on the data it is applied against. The data cleansing pattern validates the data (whether persistent or transient) against the cleansing rules and then applies changes to enforce consistency and quality. In this article, you learn to apply the data cleansing pattern within a Service-Oriented Architecture (SOA) context. This pattern specification helps you, as a data or application architect, to make informed architectural decisions and to improve decision guidelines. Information is one of the most strategic assets of any organization. Data quality is a key prerequisite for using information to gain insightful advantages in the marketplace. Poor data merely turns potentially highly valuable information into worthless byte streams. For example, incorrect address information for a "party" (such as a client, patient, or customer) inhibits strategic insight, preventing accurate and relevant insight from being surfaced. This kind of data can include information about whether two parties are the same or not, the total number of clients, and getting a 360-degree view of a customer. Sloppy data can decrease customer satisfaction, complicate communication, and increase costs when trying to circumvent the problem, as well as create other challenges.
Data quality issues are most severe when information is scattered across isolated and heterogeneous data stores. The heterogeneous and isolated nature of such an environment often goes along with an architecture with different formats and inconsistent values. Even within a single database, the quality of persisted data is not necessarily better if the appropriate rules are not enforced. Whether information is at rest in a data store or manipulated on the fly by an application, data quality is often either not enforced at all or controlled by different components using inconsistent rules that are embedded in some application code. To turn information into insight and to leverage its significant value, data quality needs to be addressed by applying data cleansing consistently; that is, using consistent cleansing rules throughout the enterprise, not only in the database layer but also in the application and process layers. After briefly describing the value of this approach, you'll learn the context in which the data cleansing pattern should be applied. Next, you learn about the problem and the solution approach for this pattern. Finally, you'll get a summary of the focus and risk areas, as well as the constraints of this pattern.
Value proposition of the data cleansing pattern Three major values that this pattern can deliver are:
Let's take a closer look at each of these. Consistency and quality advantage The most significant benefit when applying the data cleansing pattern is that it can improve the consistency and quality of information, whether it is persisted in a database or processed by an application. This pattern improves the quality of the data and ensures that it is high quality. Applying the data cleansing pattern in an SOA context provides the business processes with the capability to manage and assure data quality "at the point of its initial collection." Applying data cleansing before the information is persisted allows for the incorporation of business-defined validation mechanisms at the point of entry, such as in data entry portals. Reduced development and maintenance costs This pattern provides recommended practices on how to specify cleansing rules and how to most effectively apply them to persistent and transient data. Many implementations of the data cleansing pattern provide sophisticated tools to develop, test, and deploy the cleansing rules. These tools can help reduce the overhead in many projects where cleansing rules are specified manually and have to be maintained in a laborious fashion. An important aspect of the data cleansing pattern is its focus on reusability at an enterprise level. If each database and application implements its own cleansing process, this can result in inconsistent cleansing rules where the degree of data quality might be increased, but not in an effective or consistent manner and not to the level required. This pattern describes how the same cleansing rules can be applied consistently to a broad range of consumers.
The traditional context of the data cleansing pattern is the database layer, which is where it is most often applied. Based on increasing interest in SOA, we see new opportunities to apply this pattern in an SOA context. The data cleansing pattern has typically been applied in the cleansing of name and address information, but it can also be applied to any freeform text, such as descriptions of products within inventory systems. Freeform text most commonly refers to data manually entered and not selected from a standard pick list or data entered without any formatting, such as a complete address in a single field. The data cleansing pattern is defined as the standardization, clean-up, and ultimately, matching (or de-duplicating) records based on the content of freeform text fields.
In the traditional non-SOA context, this process is usually a batch function, performed periodically to meet the information integrity requirements of Master Data Management and to reach a single consistent view of important corporate entities. More traditionally, data cleansing has been also applied in many data warehousing projects. To support analytics, reporting, or providing a 360-degree view of entities such as the customer, data is collected from multiple silos around the enterprise. Those islands of information can result from independent divisions, product groups, or subsidiaries or a one-time event following a merger or acquisition. If in-house technology does not exist or additional enrichment of data with demographic or marketing detail is desired, collected data is often sent to outside firms for this cleansing. Some time later, perhaps days or even longer, the data is returned, presumably de-duplicated and enhanced. If processed in house, the traditional data cleansing pattern is usually a "point-in-time" function, performed nightly or weekly for general data clean-up or loading to operational data stores, reporting systems, or data warehouses. A major retail jewelry outlet uses the data cleansing pattern on a weekly basis to reconcile customer names from an array of systems that handle new customer accounts, consumer loyalty, lead generation, and billing. Each data entry point is a possible source of duplication for customer names or for multiple customers who may live at the same location. The approach to de-duplicating and making sense of such information is one example of applying the data cleansing pattern. Seen initially as a way to save on such simple things as postage, forward-thinking companies now rely on the data cleansing pattern to gain further insight into consumer spending patterns, to better identify large volume buyers, and to consolidate sales inquiries, customer support, and billing into one location to enhance the customer experience. Figure 1 illustrates the high-level architecture of applying the data cleansing pattern in the traditional context. Figure 1. Traditional context of the data cleansing pattern The SOA context for the data cleansing pattern takes advantage of sophisticated standardization and matching techniques and extends them to the front line of near-real-time applications. Viewed in this context, the data cleansing pattern allows an enterprise to extend its capabilities for validation and matching to the point of creation. Furthermore, the same de-duplicating and matching logic used in batch operations can be integrated with sophisticated search methodologies or used to enhance the capability to locate customer information when information or identifiers are either unknown or incomplete. The SOA context for data cleansing allows for standardization and matching of individual request strings. A single name or address is dynamically cleansed, returned in a standardized format, or in the case of discovery, returned along with a set of possible candidates that are identified in the matching process. In data entry solutions, this improves data representation (consistent abbreviations of street types and states, for example) and increases the odds of finding a duplicate before it is persisted. Avoiding the problems caused by duplicates in advance is far less costly than trying to correct them afterward or suffering the consequences of mishandled accounts because a single view of a customer was not possible. A point-of-sale application, represented by the portal on the upper left in the example in Figure 2, might be used for a customer loyalty module managing profiles of common customers. In the case of the major jewelry chain mentioned earlier, this application subset is enhanced by the application of the data cleansing pattern in an SOA context. Imagine a visitor to the store who has forgotten or never knew their customer loyalty code. A casually entered (and potentially error prone) textual customer name is dynamically matched against the central master data store in real time, and a list of candidate names is returned. From the candidates, the actual customer profile can be found or verified as a new customer. The store clerk can now provide services earned by a loyal customer, such as free appraisals, and can suggest new gift purchases based on the profile, such as upcoming birthdays or anniversaries. The customer gets extra attention and feels like she is being treated as someone special. The potential revenue benefits and customer experience are realized by deploying data cleansing as a service. Matching and standardization rules are reused, taking advantage of the same functions applied during batch runs. Figure 2. SOA context of the data cleansing pattern
The data cleansing pattern addresses the challenge of improving data quality and ensuring data consistency at a metadata level as well as at the data level itself. Typical causes of insufficient and poor data quality are:
For example, the definition or constraint that a postal code must be a valid number may be missing or not enforced consistently. Implementations may fail to check whether the number is valid or whether it is a number at all. Formats to represent the same real-world entity might be conflicting (for example, number compared with string to represent a postal code). As just described, inconsistencies can surface at a metadata level, as well as at the data level itself. Even if the data models are defined appropriately and consistently, a lack of appropriate integrity constraints for data values can lead to quality and consistency issues. The same real-world entity might be represented with different data values, such as different part numbers for a product or different weight measurements. Some of the most common problems include:
The design time characteristics of the data cleansing pattern pivot around establishing standard rules for conversion and cleansing of the data sources, defining matching criteria to support de-duplication, and identifying how to determine the most current or correct data. As you might imagine, design is the most critical and complex phase in the data cleansing process. Once this task is complete, the application of the cleansing, matching, and survivorship rules are used in the run-time processes. Designers that apply this pattern must specify cleansing rules, possibly supported by appropriate tools. This activity can be categorized into four major steps:
As indicated in the problem statement, data values might be represented in free text or some aggregated fields (the attribute After data values are correctly assigned to attributes, the designer needs to specify how to standardize the values. This means that the designer needs to find answers to questions such as:
Some of the standardization rules are straightforward and do not require significant effort, such as converting mixed-character data into upper case. Some are relatively advanced and require access to a database that stores correct values, such as the correct association between postal code, city, and state in the U.S. Standardization rules may also have to be context sensitive: a string such as "St. Virginia St." is determined to have a street name of "St. Virginia" and a street type of "street" (assuming that this is a U.S. address). "St." and "St.", while technically are the same, they have a different meaning when interpreted through the eyes of an intelligent rule set. In many cases, the designer has to identify potential duplicate records. Unfortunately, even after standardizing, the data values the records are often not identical. In one record, a person's name might be "J. Smith" and in another record it might be "John Smith". One of the challenges when identifying a match is to determine how likely it is that "J. Smith" is the same as "John Smith". Obviously, this depends on the other information contained in the record. If the address is exactly the same, it might be very likely. It will also depend on how many people have the same first name and last name: it is less likely that there are two "April Back-Cunninghams" in the same city than there are two "Robert Johnsons". The same pattern that applies to parsing and lexical analysis of freeform names and addresses can be applied to standardize and de-duplicate product listings or parts for enhanced inventory control. It should be noted that we have used mostly U.S. addresses, but this approach certainly applies to any country and not to just addresses. There are two methods to address matching: deterministic matching and probabilistic matching. Deterministic matching is based on business rules and algorithms to define a match. The advantage of this method is that it provides a clear answer such as two records do or do not match. However, the rule set is typically limited to rules and algorithms that are categorized with a degree of simple or medium complexity. The probabilistic method leverages statistical algorithms and fuzzy logic to indicate a match. This approach uses more powerful mechanisms to identify a match and provides the probability that records match, such as 93%. The degree of confidence in the match should be balanced against the value of the information being addressed and the cost to determine the match. Based on the matching rules, the designer specifies survivorship rules that determine which record and attributes of a record reflect the correct information and are carried forward and those which need to be discarded. Now that we've completed the core part of specifying the cleansing rules, the final step is to specify how to gather the input data (using database extract / query, service request) and how to provide the output of the cleansing process (apply to a database or provide as a service response). Figure 3 shows an overview of this activity. Figure 3. Design time aspects of the Data Cleansing Pattern
It is important to note that the data cleansing pattern is often applied together with other patterns; the green boxes in Figure 3 are such an example. For the developer or designer to specify cleansing rules, a sufficient understanding of the data sources against which the data cleansing pattern shall be applied is necessary. This includes identifying and understanding the information semantics, such as the meaning of the data model elements as well as structural information. Data profilingwill help to derive this knowledge from the underlying data sources. In many cases, the data cleansing pattern is applied together with the data consolidation pattern. In such a scenario, the mappings between data elements from the sources into the target need to be specified (calledintegration modeling in Figure 3). The data cleansing service receives data with an undetermined level of data quality as input. Typically the service is either invoked with this input as a parameter of the service request (by value) or the service gathers the data from one or more specified source (by reference). The service then applies the cleansing rules against the source data. Depending on the complexity of the data cleansing rules, this process may require a lookup within a database or a dictionary to validate the correctness of information (such as a correct combination of postal code, city, state). Upon successful completion of the processing, the cleansed information is then returned as a reply by the service. In a traditional context, the output is most often applied to a database or further processed in a data consolidation process. (See Resources for more articles about data consolidation.) Data cleansing services frequently require a high degree of sophistication implemented by a data cleansing server that can provide high levels of performance and scalability through the exploitation of parallel processing. This server is then able to process and cleanse extremely large data volumes in batch mode as well as single records in a real-time invocation environment. The requirement to process large data volumes is illustrated by a scenario involving the consolidation and cleansing of the complete content of multiple legacy systems. At the other end of the spectrum, the data cleansing server may be invoked through a Web service by a portal application to check an address that was entered on the screen. In this situation, the server may need to respond in real time to a high number of concurrent requests.
When applying the data cleansing pattern, it is important to understand how it impacts the following nonfunctional requirements. Frequency of transaction execution The ability of the cleansing service to execute a data cleansing transaction at a high rate is determined by the rate at which the data cleansing server can access the input data and apply the cleansing rules. The more complex the cleansing rules are and the more lookups are required, the longer the execution of the cleansing activity will be. A data cleansing server that can exploit parallel processing capabilities will have a significant advantage over other approaches; IBM® WebSphere® QualityStage (see Resources) is one such example of an off-the-shelf data cleansing server.) Performance/transaction response time The transaction response time (applying cleansing rules against input data and returning the output) is determined by the complexity of the cleansing rules and the ability of the cleansing server to efficiently process the data. Cleansing server implementations that can compile the rules and exploit parallel processing capabilities will perform better than alternative implementations. It is fairly common to apply the data cleansing pattern against sets of bulk data as well as individual records. Therefore, the data cleansing server needs to be able to scale in order to process large data volumes. The data cleansing activities (parsing or separation of values, standardization, matching and survivorship) are specified as cleansing rules. They are ultimately applied to the transformation of input data, which may be of low quality, into an output with a higher level of consistency and quality. Since these transformation rules may be numerous and complex, many implementations of the data cleansing pattern deploy the cleansing rules as transformation operations using a data cleansing server. The transformation capabilities of the data cleansing pattern are specialized and focus upon improving data quality and integrity by standardizing and matching data. More general data transformation approaches -- such as described in Part 2: Data consolidation pattern -- focus on the exchange and the reformatting, splitting, and merging of data and do not have sophisticated support for data quality. Transformation requirements are most frequently influenced by the variety of data sources and as such, the capability to define the characteristics of complex transformations is key. The more complex and varied the transformation requirements, the more sophisticated the run-time transformation or the data cleansing server must be. Type of source model, interfaces, protocols Product implementations of the data cleansing pattern vary in the range of formats they can support for input data. Applied fully and correctly the data cleansing pattern removes the complexity of various source formats, interfaces, and protocols so that developers only need to care about one model, interface, and protocol. Product implementations of the data cleansing pattern frequently provide very sophisticated tooling support to specify the cleansing rules and to visualize the effectiveness of those rules on the input data. For example, which records are identified as a match based on the probabilistic matching rules? Many of these implementations have predefined (cleansing rule) operations that are provided out-of-the-box with the products. This feature allows the implementer to apply this approach very efficiently in a short period of time. This pattern is often applied against data with a very low degree of quality and consistency. As a result iterative refinements of the cleansing rules may be required. The effort involved is directly related to the characteristics of the problem at hand and not the pattern approach. There are two primary skill sets and experiences required for data cleansing. The analytical skills necessary to understand and define the nuances associated with the data sources and targets with respect to format, values, ranges, and other characteristics are important. If a tooling approach is used, the analyst will require product specific knowledge in order to define the cleansing rules. Second, the developer will need to understand SOA concepts, standards, and technologies necessary for implementation of the cleansing services. In addition, if a tooling approach is used for a run-time server, the developer and system architect will require knowledge and experience to ensure that the server is scalable to meet the required service-level agreements. The development costs depend largely on the complexity of the data cleansing task. The costs are very low if only basic standardization rules need to be applied. The more complex the cleansing rules become, the higher the implementation development costs will be. The higher costs are associated with the data analysis, along with the iterative development and testing cycles necessary to address the complexity. Reusability in the data cleansing patterns is realized through the definition of cleansing rules that can be applied at the record level through a service or through a batch process for bulk data. The second opportunity for reuse is through the use of a common server process for the execution of the cleansing rules.
The data cleansing pattern specifies a recommended practice for how to improve the data quality of persistent data either at entry or later. Focus areas to apply the data cleansing pattern
Risk areas to apply the data cleansing pattern
The following IBM products implement this pattern:
The authors wish to thank Jonathan Adams and Lou Thomason for their support in writing this article and in developing this pattern. Learn
Get products and technologies
Discuss
|