Data Quantity Requirement for ML

Data Quantity Requirement for ML

A dive into measures for finding data quantity required for machine-learning models

ยท

5 min read

In my previous blog, I gave a brief overview of how the EDA process can be divided into different steps. Today, I will be shedding light on the quantitative requirement of data for developing a successful machine-learning model, a part of the Structural Investigation phase of an EDA. Now, in later stages, before we move towards the model development phase, we have to ensure that the amount of given data satisfies our needs i.e. the data is both versatile in quantity as well as quality for the given use case, but we are just going to focus on the quantity factor for now.

Factors Affecting Data Quantity

Data is an essential need for training and developing a machine-learning model. It is true that the amount of data directly influences the performance of a machine learning model, but it depends on different factors as to how much data is needed to solve it. I will list some of these factors as under:

  • The complexity of the problem

  • The complexity of the learning approach

  • The success threshold criteria

The Complexity of the Problem

One factor that influences the need for data quantity is the complexity of the problem at hand. If the problem is relatively simple in nature, it may not need as much data as you would think. By simple, I mean that the number of records and features that the model needs to take into account for finding a successful pattern is low. A very popular example of such a case is the well-known Iris dataset. The dataset has only 150 records and 5 variables. Thus, to train a model on this dataset, one would need to use only 4 features (1 is the target variable) which themselves are diverse in quality and thus are sufficient to train a good classification model. Now, if the problem was more complex in nature, this wouldn't have been the case. In a complex problem, the model needs to take into account many parameters for its prediction and thus needs more data to give a good result. An example of this can be found here. The dataset was provided in a community competition to detect spammers on Fiver based on the set of 53 features provided by them. You can guess that based on whatever different variables they gathered, we would need a high number of records to successfully arrive at a model that can find patterns from the provided features. This is visible from the fact that they have given a dataset with 458798 records just in the training set.

The Complexity of the Learning Approach

Another factor that influences the need for data quantity is the complexity of the learning approach being used. For every problem, we can always apply a variety of learning solutions to get a good result. The complexity of the learning approach directly depends on how that approach deals with the data or the computation process it uses to get its results. Simpler machine learning algorithms like linear regression, Naive Bayes, etc. make use of relatively smaller datasets as opposed to learning algorithms like GANs, BERT, etc. This is because deep learning models have more complex behavior in comparison to general machine learning models due to the high number of parameters.

The Success Threshold Criteria

The success criteria are also an important factor in determining the quantity of data needed to train the model. This factor depends on the type of problem being handled. Let's say we are dealing with a machine-learning classification problem of identifying cars on a track. Although the model to train for this case must be efficient, we can put it into practice even if it gives a result of approximately 80-90%. However, if we take the same classification problem and use it for diagnosing some form of life-threatening disease in a patient, it would backfire even with 80-90% and thus we would need to be as efficient as we can possibly get to put it into practice. Since the efficiency of a model directly corresponds with the amount of data required, we can say success thresholds directly affect the quantity of data needed.

Measuring Data Quantity

A thing to keep in mind is that the number of records provided to us in the EDA phase does not represent the data quantity of the dataset for the problem under consideration, rather the data obtained after preprocessing step gives us insights into its quantity. Collected Data sent to Data Scientists for EDA and preprocessing sometimes contain data that can be looked into from different perspectives based on the type of problem being dealt with.

Let's say we get a collected dataset for a job survey with features [id, empname, age, gender, compname, compaddr, jobtitle, hiredate, yearsofexp, salary]. Now one way of presenting a problem using this dataset is "Given these features, determine the salary of this person with these features", another can be "Given these features, determine which gender is more likely to have these features", etc. Depending on the problem, the data would be cleaned and the resultant dataset would be used for machine learning development.

Let's say this model had 1500 records when loaded from the data resource and after preprocessing only 600 records are left, considering the salary regression problem, one can say this data may not be sufficient to get a good result (assuming each feature covers up a diverse set of values) as only 40% of the original data is capable for the model development process. In cases with say less diverse features, this may also be enough to get a good result. So in the end, it depends on the type of problem and nature of the data being used.

General Rule of Thumb

The general rule of thumb for data quantity is that the number of records in the dataset must be at least 10x more than the number of features in a machine learning task but as mentioned previously, all this depends on the different factors discussed today.

Conclusion

So today, we overviewed the assessment measures for the required data quantity to train a machine-learning model. We saw that the need for data depends both on the quantity and quality, after all, a dataset with many poor-quality records would still perform worse than a model with few high-quality records.


That's it for today! Hope you enjoyed the article and got to learn something from it. Don't forget to comment with your feedback on the approach. How do you identify if the data quantity in your datasets is enough for a successful model? Do share your thoughts in the comments.

Thanks for reading! Hope you have a great day! ๐Ÿ˜„๐Ÿ˜„

ย