Skip to content

good practice and innovation
about us infoKits Tools & Techniques Publications Events
You are here: Home » infoKits » System Implementation infoKit » Conducting a System Implementation » Data and Reporting » Data Cleansing

Data Cleansing

Ed Wrazen, Vice-President Marketing for Avellino, says 'A massive 88% of data integration and migration projects either exceed timescales or budgets or fail altogether. This high rate of failure applies to ERP projects and is predominantly because of the poor quality of data in legacy systems'.

We have all experienced the result of poor data management when we have received multiple copies of the same letter due to duplicate records, often with our names incorrectly spelt as well. This shows that, even when a system is fully operational, data errors can cost an organisation unknown amounts of money, processing resource and goodwill. Those costs are increased when replacement systems are implemented with error-strewn data. According to the Quality Assurance Institute in the USA, 'It can cost up to 100 times more to rectify data quality problems at the testing phase than at the design phase of a project' (2004).

Organisations tend to carelessly rely on the quality and consistency of the data in their existing systems and it is often only when a replacement system is tested after data is migrated from disparate sources that data errors and inconsistencies surface. Therefore, it is vitally important to ensure that existing data is correct before migrating it to your new system.

Some example of the types of data headaches that emerge when considering data migration are outlined below.

  • data incorrectly entered
  • data not entered at all
  • data entered more than once (i.e. duplicate records)
  • data partially entered and, therefore, incomplete
  • data correctly entered, but no longer used (i.e. redundant data)
  • data once correctly entered, but now superseded (i.e. old or out-of-date data)
  • data entered inconsistently (e.g. Smith WH; Smith W.H.; W.H. Smith)
  • data entered according to different codes (e.g. Male = 'M', 'Male' or '1')
  • data entered according to different standards (e.g. Dates shown as DD/MM/YY or YY/MM/DD)
  • data artificially contrived to mean something different (e.g. A date of birth entered to record that the person was over 65 years of age rather than the true date they were born) and
  • data correctly entered, but incorrectly processed by the system, and not recognised and, consequently, not amended or, worse still, exacerbated by cumulative errors

Data Cleansing is the name given to the process of detecting, removing and/or correcting errors in data held. The aim of data cleansing is not just to clean the data in a database, but also to bring consistency to different sets of data that have been merged from separate databases.

It is essential to initiate data cleansing procedures well in advance of the time that data is required to be transferred to the new system, and to assign responsibilities and put better procedures in place (e.g. data validation routines, business rules, user training, coding systems, agreed terminology, entry forms, instruction documents) to ensure that clean data is entered in the first place and remains clean.

Time and resources must be allocated for detecting data errors, writing routines to correct errors and/or entering correct data, and documenting the new data validation procedures when planning data cleansing activities. The new procedures may increase data entry costs, but will certainly save the far larger cost of correcting data later or not correcting it at all.

There are a number of tools on the market that can help clean data. They can be expensive, but some companies offer a bureau service to inspect and clean data on a regular basis. Simpler techniques, such as standardising names and addresses and cross-checking against an established database, are also available and can prove to be more effective and more cost-effective.


Bookmark and Share
If you can read this text, it means you are not experiencing the Plone design at its best. Plone makes heavy use of CSS, which means it is accessible to any internet browser, but the design needs a standards-compliant browser to look like we intended it. Just so you know ;)