Demystifying Duplicate Values, their Causes, and Preventions
Discover the causes and negative impact of duplicate values on data integrity, predictions, and analysis and the methods to prevent them
Duplicate Values are one of the major factors in demeaning data integrity. This is because duplicates are simply a copy of an already present record in a dataset appearing multiple times. Duplicates thus are a very common problem that induces biases in the dataset by increasing the occurrence of some records over others. This leads to inaccurate and inefficient predictions and analysis. Thus, it is important to analyze the impacts, causes, and measures of prevention for duplicates in order to obtain a good prediction model.
Impact of Duplicate Values
Duplicate values have a lasting impact on the validity of the data as well as the prediction model to be developed using that data. Since it induces dirtiness, the analysis conducted on the dataset also returns misguided insights and hence results in the generation of a poor machine learning model. Following are some impacts of duplicate values on our analysis:
Inconsistency
Data Consistency is essential for a good quality dataset and refers to the property of data to provide a consistent view among different locations where data is stored i.e. all values of a record are stored in the same format across all locations. Inconsistency is induced by duplicate values as there might be records that refer to the same instance but have some updated set of fields, thus, making it difficult to analyze which one is the most up-to-date and accurate one.
Example
In a FIFA database, we can have two records for the same football player e.g. say L.Messi and Lionel Messi, both of which represent the same player but are stored as two different records, etc.
Increased Costs
A dataset is usually stored in some sort of combined data source from where the data is extracted for the machine learning process. In case of duplicate values, the size of the dataset increases due to the presence of the same records multiple times, which results in increased cost of storage. This effect increases exponentially with large datasets as with the increased amount of records and duplicates, the storage cost also increases drastically.
Example
Assume that a dataset takes a total of 20 MB storage, then if each of the records has at least 1 duplicate, the total storage of the dataset would increase to at least 40 MB.
Repetitive Preprocessing
In the preprocessing step, each record in the dataset undergoes the steps defined in the preprocessing pipeline, consuming some amount of time to be processed. With the presence of duplicate values, the same record is processed more than once in different locations, which leads to increased steps during preprocessing and thus consumes more time than is actually needed. Thus, duplicate values cause delays in the machine learning pipeline process by wasting time on repetitive processing of unused records.
Example
Assume that we are processing to generate a bio-data document using the information fields stored about a person in a dataset. With the increase in duplicate values, we would have to generate the same document multiple times for the same person.
Incorrect Insights
Another impact of duplicate values is that they tend to provide wrong and incorrect insights to users for the given data. By providing repetition of values in a dataset, duplicate values increase the occurrence of certain records, which are unique in nature, resulting in biasness inside the dataset towards those records or the fields they represent. This leads to misguided results during the analysis of the data via data mining and visualization techniques and thus results in providing us wrong insights on the matter.
Example
Assume that there are 4 records for the sales of a particular product, with each sale of 5000 PKR, then if there are 2 duplicate records for these sales, we would have 6 records instead and the total sales would give the value 30000 PKR instead of 20000 PKR.
Causes for Data Duplication
Data Duplication doesn't occur on its own and has some causes bound to it. These causes are stated as under:
Human Generated Error
As the saying goes "To err with humans", a human can generate errors when performing manual data entry operations into the system which is acting as a data source for the collected data.
Example
Consider a product sales dataset for some super store, one of whose sources are data from a mini-branch whose data entry operations are managed manually. In such a case, it might happen that the person responsible for data entry may enter the record for sales of a certain product twice or slightly tweak the value of one of its fields e.g. write price as 90.4 instead of 9.04, etc.
System Generated Error
Systems are not prone to errors and just as humans may result in duplicating a particular set of records. This might take place due to no data governance and data consistency checks on the system for such cases.
Example
Consider a product sales dataset for some superstore, one of whose sources are data from a mini-branch that uses an automated system for data entry operations. In such a case, it might happen that the system is not set to validate if a slight variant of a particular entry has been made e.g. the system may write two separate records for Coke and Coca-Cola, although both represent the same product.
Migration Error
Data Migration often occurs in cases where there is no validation setup for migration from one data source to another source. This can happen in cases where the migration system may restart the data migration process after failure and no measures are taken to roll back the changes made in the dataset under consideration.
Example
Consider a product sales dataset for some super store, one of whose sources is data from a mini-branch that uses an automated migration system to store records in the dataset. In such a case, it might happen that the migration system fails during its process and restarts from the first record but no measures have been taken to roll back the data to its previous state in case of incomplete migrations.
Merge Error
Data Merging is the process of combining two or more datasets into a single dataset and is used when we have to collect data from multiple sources into a single source. If not properly handled, data merging can lead to duplication of records. This happens in cases where there is no check on whether a particular record is present in two or more sources.
Example
Consider a product sales dataset for some super store, being generated by using data from two of its branches. Assume that there are some products in the two branches that represent the same product with the same price but are named differently in their records e.g. one has stored it as "Coke" and the other has stored it as "Coca Cola", then with the absence of a validator, two different records would be stored for them in the combined dataset.
Avoiding Data Duplication
There are many measures that can be utilized to avoid data duplication in the first place. These measures are stated as under:
Validation Checks
Validation Checks are necessary to avoid data duplication problems. If a system has a well-defined validation system for duplicate or resembling records, it will never insert the same record into the dataset. This will allow the system to be safe from erroneous records and hence the data will remain clean.
Example
Consider a product sales dataset for a superstore, where the data entry operator has to enter the names of products along with their number of sales for their respective branch. It might happen that the operator writes down the same product name or a slight variant of it again, but if the system already has some validation setup, it will never let the user re-enter a product that has already been fed to the dataset.
Data Governance
Data Governance is necessary in order to avoid duplication problems. A system or application with a proper governance system in place would never let a person enter a value that is already present in the dataset.
Example
Consider a product sales dataset for a superstore, where the data entry operator has to enter the names of products along with their number of sales for their respective branch. If a proper governance process is established, the system would never let the operator enter the same name of the product, rather would ask him if he refers to that product that resembles what he already has stored in the dataset and update its value if that is the case.
Merge Validation
Merge Validation is the process of performing validation to ensure that the merge operation from multiple sources took place without the generation of some set of duplicate records.
Example
Consider a product sales dataset for a superstore, which is generated by merging two different datasets from different sources. If the merge constraints are set up properly, then even with the occurrence of the same record in both datasets, the record will be inserted in the combined dataset only once.
Data Archiving
Archiving is the process of storing the previously inserted data as a backup in case of insertion of some erroneous data into the dataset. This allows the system to roll back the dataset to its previous state in case of system failure and hence avoids duplicates if data migration is taking place.
Example
Consider a product sales dataset for a superstore, which is programmed as an automated system for a mini-branch and uses a data migration process to migrate data from one source to another. If the system has a proper recovery system setup, it would just roll back to the previous version of the dataset in case of failures in the migration system and thus the data will remain consistent.
Measures to Eliminate Data Duplication
Assuming that no measures were taken to avoid data duplication, we can still take care of duplicated data by making use of certain techniques. These measures are as under:
Duplicate Elimination
One way to eliminate duplicate data is to remove it entirely. The decision to remove duplicate data takes place by dropping the duplicate values based on some sort of pattern defined, keeping the latest or first value or some other value as per the pattern, and dropping the rest.
Example
Assuming that the information for the price of the product is added on a regular basis and only the updated values are required for the products, then we can perform duplicate elimination on the product names by keeping only the recent value and dropping the rest of them.
Linkage Techniques
Another way to eliminate duplicate data is with the help of linkage i.e. making use of some algorithms to merge the duplicate records into one. Common algorithms utilized for this process include edit distance, record linkage, fuzzy matching, etc. These algorithms take the different record values and make sure to generate match pairs for values that match each other. The entries are then either merged or removed to eliminate duplicates
Example
Assuming that the information for the price of the product is added with the product name and someone by chance enters "Coca-Cola" and "Coka Cola", then using linkage algorithms, we can match these values and store it only as a single record instead of two separate records.
Data Standardization
Standardization is a technique to keep the data consistent in nature by keeping it in the standard format. This makes use of different units of measurement of a value or formats and converts them into a standard accepted format for entry.
Example
Assume that the information for the date of the product sale is stored in different formats for different products. We can use data standardization to convert all these dates into a single standard format instead.
Data Mapping
Sometimes there are many common ways to express the same thing. In such cases, we make use of a set of mappings i.e. if the particular item being entered into the dataset represents one of the values in a certain mapping, it will be mapped to a single defined value. That value will then be stored in the dataset instead of the entered one.
Example
Assume that the information for the product sales is kept in a dataset with the product name and its price, then for the product "Coca Cola", we know that it is also called "Coke" in practice, thus, we can set a mapping that whenever someone inserts the values "Coca Cola" or "Coke", it will always be mapped to "Coca Cola". This identifies all the duplicates present in the dataset, which can then be merged or removed.
Conclusion
In this article, we overviewed the different impacts, causes, and measures of prevention of data duplication in datasets. Next time, I will be discussing more deeply the techniques I have mentioned one by one with the help of a real dataset. So stay tuned! ✍️✍️
That's it for today! Hope you enjoyed the article and got to learn something from it. Don't forget to comment with your feedback on the approach. Did I miss any of the causes, impacts, or measures? If yes, do share them in the comments.
Thanks for reading! Hope you have a great day! 😄😄