A few days ago I saw a tweet that referred to this question in Reddit: “what’s part of the real job that’s not part of the Kaggle workflow?”.

There are many answers to this question but one that I’ve had in mind for a long while is this: putting together a dataset. The following tweet also echoes the same sentiment:

One of the biggest failures I see in junior ML/CV engineers is a complete lack of interest in building data sets. While it is boring grunt work I think there is so much to be learned in putting together a dataset. It is like half the problem.— Katherine Scott (@kscottz) February 1, 2019

Now let’s go back to the Reddit post.

The reality is, in real-world situations, Step (1) is rarely just that. In fact I can’t recall the time someone hands me a .csv dataset while saying “you know you just need to load this & get started.” Nope. Never.

Step (6)… yeah, sure, that’s usually part of the workflow. But models are not the only things you can iterate on. You can also iterate on your data, which means sometimes you have to go back to Step (1) again. Speaking from personal experience, some of the most impactful performance increase can be gained from iterating on the data.

These two things bring me to my point: we need to talk more about putting together a dataset, because of two reasons: 1) outside of Kaggle, oftentimes you must always build your own dataset first; 2) sometimes you don’t only do it once for the same project, but maybe twice, or thrice, at various parts of the project.

Data and where to find them

Here’s the cold hard truth: most of the time, the dataset of your dreams—that mythical one piece of ready-to-use .csv file—might not exist.

You will often face a situation where you have a very limited dataset or worse, your dataset does not exist yet. Below are some common challenges I can think of, though keep in mind that a) encountering > 1 of these challenges for the same project are very possible, b) whether you’ll come across them or not may depend on your company’s data maturity.

Your company doesn’t know they’d need it in the future, so they don’t collect them. This is probably less probable in large companies with resources to have a data lake, but there’s still possibility that this happens. There’s probably not much that you can do other than making the case that you need to collect this data & justify the budget & time needed to do it. At this stage your persuasion skillz are probably more important than your SQL skillz.

The data that you need do exist in a data lake, but they need to be transformed before you start using it. Transforming it might become someone’s backlog that might not be picked up until the next few sprints. Either way, you won’t have that dataset that you want immediately.

You have your data, but they are not labeled. This might be a problem if you are trying to do a supervised learning task. A quick solution might be to hire annotators from Mechanical Turk, but this might not be possible if you have a domain-specific data or sensitive data where masking makes the annotation task impossible. I’ve also seen companies having “Data Labeler” as one of their job openings, but you might have to think whether it makes sense/possible for your company to hire someone part/full-time to label your data.

Once you have annotators, you might also want to strategize the kind of labels that you need so that they can be useful for future cases to save cost and time, so you don’t need to label the same data twice. For example, if you need to label a large set of tweets with “Positive” vs “Negative”, you probably need to anticipate future needs by making more granular labels instead (e.g. “Happy”, “Sad”, etc.).

You can always try other approaches, e.g. semi-supervised learning or unsupervised learning. But it’s not always possible, & considering various constraints, sometimes you need to really calculate which one is more worth it, e.g. pursuing a semi-supervised learning approach that you still need to explore vs a supervised learning approach you know better with the help of annotators. This may depend on various factors: time, budget, etc.

You have your data, but they are weakly labeled. These labels do not necessarily correspond to the target of your model, but you can use them as some sort of a proxy for it. For example, you may not have the data whether a user likes an item or not, but perhaps you can infer that information from the number of times the user views the items. Of course, you need to think whether it makes sense in your case to have this information as a proxy.

You have your data, but the target labels are still fuzzy. Some problems are not as straightforward as “this image contains Cheetos” & “this image does not contain Cheetos”. Sometimes stakeholders come up to you and say that they want more granular predictions that they can tweak later on. At this point you may need to work very closely with your business stakeholders to figure out the target labels, & how you can make the data that you have work with such requests.

You think you have your data, but you don’t know where they are. Say you work in a bank. You know you must have transactions data, & there is no way you don’t have users data. However, you may not know where they are, what they look like, the filters you need to use to filter the data, the keys you can use to join the two tables together (hint: it’s not always that simple). Documentations may exist but the details could still be fuzzy to you. You need to ask someone. But who? You need to find out. You ask them questions. They may or may not respond to your queries quickly because they also have job to do. The data might or might not contain the fields that you expect, & it turns out to get the dataset that you want there is something more than a simple join between two tables.

Don’t trust your data right away

Okay, great, you have your data. Can we load them for training now? Not so fast. You may need to spend sometime to make sure that your dataset is reliable—that it actually contains the things that you expect them to contain. This is a tricky one I’d say because the definition of “reliable” differs from each case, so you really have to define it yourself. Much of this comes down to how well you understand your problem, how well you understand your data, & how careful you are.

Hold up, this is part of the data cleaning (Step 2), isn’t it? We can just drop missing data etc. etc. no? If you refer to most data cleaning tutorials, they make it seem that it’s straightforward: you can just drop the rows with missing values/impute them with the mean of the column or something, and then you can go to Step 3. But in reality: a) you often get some very funky cases that these tutorials may not cover, b) these funky cases may be symptoms to a larger issue in the engineering pipeline that you may start questioning the reliability of your entire dataset, not just that one particular column. Oh, take me back to the start, sings Coldplay, & off to the start (aka Step 1) you go.

Some common pitfalls off the top of my head:

  • Erroneous labels. This is especially when these are human-annotated labels. Going through the data by hand can be helpful for you to get a sense of this—even if you don’t manage to go through all of them (which is understandable—scrolling through some data for 5 hours might not be the wisest use of your time anyway), you can have at least a sense of the percentage of human level error (if you don’t have one yet). Knowing the human level error can help you calculate what Andrew Ng refers to as the avoidable bias, & knowing the avoidable bias can be helpful for you to determine your next step.

  • Missing data. Missing data does not only mean fields with empty values that you can find with df[df["col_1]".isnull()]. Say that you have a column that says “Province”. Out of all your data, there is no value that says “West Java”. Does that make sense? Can you trust your data? Sometimes it’s not always about what’s there, but also about what’s not there.

  • Duplicated data. Sounds trivial—we can just do pd.drop_duplicates, no? It depends. If you have images for example, you might also want to think about what kind of duplication you want to remove (do you only want to remove exact duplicates or near-duplicates too?). Fun fact: 3.3% and 10% of the images from the test sets of the oft-used CIFAR-10 and CIFAR-100 have duplicates in the training set1, & this was discovered only pretty recently in 2019.

  • Values that just don’t make sense. Again, this really depends on the context of your data. Example: you have rows where the registration_date is somehow more recent than the last_transaction_date. Does it make sense? Can we detect these strange values using outlier detection? you might ask. Well, what if most of your values are strange values? You never really know until you take a look at the data yourself.

  • Assumptions. It’s easy (& dangerous!) to make assumptions with data: sometimes we just assume that a field is generated in a certain way, we just assume that these values mean what we think they mean, & the list goes on. Maybe we have dealt similar tables before, or we have just dealt with similarly named colum names in a different table, so we carry these assumptions to our next data or project. It’s worth it to spare some extra time to check the documentation or ask people who may know better before you go too far.

There is no recipe here except to really get to know your data & be critical of the data that you have, which means you probably need to spend some time to really look at it, & slice & dice your data in many different ways.

I usually spare some time to manually scan through my data just to have a sense of what’s going on. From such a simple exercise I can get a sense, for example, that humans typically misclassify certain classes, so the labels related to these classes are probably not reliable, & it’s probably understandable if my model makes a few mistakes in these classes. Andrej Karpathy also did this exercise on the ImageNet data & wrote about what he learned in his blog.

You may have to revisit your dataset multiple times

When it turns out your models do not perform well, there are a few things you can do here. It’s important to remember that it’s not limited to tuning your hyperparameters.

One of them is revisiting your dataset. When you do, you may decide to acquire a bigger dataset. You may try various data augmentation strategies, & when you do you still need to be critical of your data & the methods you apply (does it make sense to apply rotation for images of numbers?). You may decide that you need add more (better) features, but they don’t exist in the tables yet, so you need to talk to your peers who handle this & see if you can get them in time. It’s Step (1) all over again, & it’s fine. It happens. But at least knowing all these, you’re more prepared now.

How can I practice?

If you cannot learn all of these from Kaggle, then how can you learn it by yourself if you don’t have stakeholders & access to company data that comes with all shapes & sizes with their own mishaps to practice on?

I think building your own side projects outside of Kaggle problems can be a great way for you to familiarize yourself with these challenges. The most important thing is you do not start with the data that you are given, but you start with defining your own problem statement & search for datasets that are relevant to your problem instead of the other way around. If the perfect dataset for your problem doesn’t exist (most likely it doesn’t), then it’s a good time to practice: fetch the data yourself (for example, using the Twitter API), join them with other data sources that you can, say, find in Google Dataset Search or Kaggle Datasets, find ways to use the weakly labeled datasets, & get creative with the imperfect data that you have.

Further readings

Some work related to this topic that you might find interesting: