I remembered the analytics and machine learning projects I did as an undergraduate, the professors would expect us to produce presentations and scientific reports by implementing statistical and machine learning concepts taught in class on the final few weeks of the semester. These projects have one thing in common: We are to choose readily available data sets. Usually from Kaggle. Now don’t get me wrong. Kaggle is a fantastic source of public data and a great site for competitions to flex on your data science skills. But these are data that are prepared for you that requires minimal pre-processing and more towards refining your algorithm to achieve that top N score in the leader board. Fresh graduates looking for a job in the data science/analytics field are most likely equipped with solid programming skills, perhaps some internship and multiple in-school project experience. They do up a short and sweet resume, update their LinkedIn and probably their Github repositories and start sending them out as if they are distributing flyers, attend a couple of interviews, only to receive multiple rejections.
Being rejected is perfectly normal for anyone actively looking for jobs. But they need to reflect on the rejections and grow from them. And to that, I present to you, the Dunning-Kruger Effect.
The Dunning-Kruger Effect is self-explanatory, but allow me to put it into the context of data science and analytics to help those seeking jobs in this field have the edge over others.
All data scientists and analysts are employed to provide statistical insights and business intelligence in their respective company. Some companies would have a swarm of excel spreadsheets with thousands or millions of observations ready for you provide any sort of visualization or predictive analytics. Some have specific request of what they want, but some don’t. And here is the problem:
Many schools do not train us to formulate data-related business questions and provide solutions to existing products or services. How do we propose data collection methods that solves a business problem from scratch?
Imagine you are told that your customers are feeling unhappy about your company’s services for quite a while. How are you going to solve the problem? What data is required to solve your problem? What are you going to tell your boss? This is the issue with fresh graduates. We are so used to being spoon-fed with readily available data that we tend to overlook one of the most important part of being a data analyst/scientist: being able to formulate multiple variables from scratch and proposing collection methods to help with your data post-processing work. There are many angles to address this example. From as simple as collecting customer feedback in the form of free texts for natural language processing(NLP), to multiple quan-qual variables such as ratings, time of complain, age of customer etc.
In math majors, emphasis is placed on your understanding in the interpretation of your statistical testings or model fitting. Thus it makes sense to provide data that is designed to have some highly correlated variables for easy inference.
Take the ever-popular Boston Housing Price Dataset for example. They give you the satisfaction that your models work with high predictive capabilities.
My suggestion: Start a small personal project of your own that involves collecting data for yourself. Maybe you are interested in predicting the price of plane tickets by extracting prices from multiple flights here. Maybe you are interested in classifying your own e-mail inbox into multiple categories using NLP. There are endless possibilities you can apply data science into your daily life. Do not be restricted to the materials your school provide. Real world problems require real world data. Last but not least, share your project with the public! You never know if a potential employer might poach you for your work!
Suppose you have access to your company’s data marts, you import data into your R or Python scripts as a data frame. You have X number of variables and it is up to you to perform any sort of analysis or machine learning you want. In school, chances are the analysis requirements are laid out for you and the data does not need any pre-processing. Students are so used to being provided with learning materials that they may not realize how imperfect real world data is. Errors can occur from data collected by humans. Certain missing entries will be inevitable. Spelling errors are bound to occur in free-texts. Dates from different data sources may be in different formats. The list goes on!
Your data is only as good for analysis or machine learning as its quality. Remove those outliers, remove/replace those missing data in your discretion, create new variables that you think might help with your project, make sure the dataset is standardized when required, etc. Always assume that the data you are initially presented with requires cleaning!
My suggestion: Certain public data source from Kaggle provide datasets that is as raw as it can get. Take the Boston Airbnb Data for example. They have a csv file that contains 95 variables! Some has missing data, some variables appear useless, depending on what sort of business question you are trying to answer. So take that dataset as an exercise for pre-processing and hone your business sense at the same time!
If your code starts with:
# For python import pandas as pd df = pd.read_csv(“dataset.csv”) # For R df <- read.csv("dataset.csv")
Then — just like me, you are manually pulling data from the data marts and doing all your data processing/transformation in your script. This is a basic requirement for data analysts. The task of manually pulling out CSV files from data marts can be daunting the more you are required to retrieve data from multiple sources.
Data scientists go beyond the concept of merely reading CSV files and spend most of their data pre-processing time not only in their scripts, but also with data engineers to establish a solid ETL data infrastructure for efficient importing of clean and standardized data from multiple sources.
The importance of a good ETL infrastructure is so important that knowledge of database systems and proficiency in SQL database querying is so important for data scientists, perhaps as equally important to their machine learning know-hows.
Schools usually have independent courses about database systems that provides students a great opportunity to hone their SQL skills and designing database systems with basic relational database concepts. But again, scenarios to design databases are provided in assignments or projects with clear-cut requirement of what the “client” wants. In the real world, sometimes formulating problems is harder than solving problems. Your business or client requirements may be very vague and it is up to your creativity to design a database that fits their needs. What variables are required for this table schema? What tables do I need for a database system to tackle the client’s requirement? Which variables require cleaning and standardization? How can I integrate my scripts with the database to provide on-the-go analytics and machine learning solutions?
Retrieving data from your own database may not be the only importing that you need to do. Learning how to retrieve data from APIs is as essential as it is required for any business problem.
Just a short disclaimer, I am not an experienced data scientist to provide accurate information about data pipeline so I do apologize for any misinformation on this part.
My suggestion: Even if you are starting off as a data analyst, pick up some relational database and SQL skills so that you can skip the process of manually opening CSV files for your projects. I have formal education on this topic but I suggest anyone who is new to SQL and wish to pick up this skill to head onto Coursera or Udemy. I am currently enrolled to The Complete SQL Bootcamp course in Udemy whenever I wish to refresh my knowledge in basic SQL for any upcoming projects. Most importantly, ask yourself how are you going to integrate your scripts with your SQL database for seamless data importing.
It doesn’t matter whether you are a data analyst or data scientist. Your work is close to meaningless if others do not see its value. In school, we are taught to interpret the significance of hypothesis testings through its p-value. It doesn’t matter if you are doing your chi-squared tests, Kruskal Wallis tests, Wilcoxon signed rank tests or any other hypothesis testing, if your first instinct after conducting your hypothesis testing is: “Since the p-value is less than 0.05, we reject the null hypothesis and conclude that…..”, then you are in for bad news. The truth is, if your immediate superior is not technically inclined, he/she doesn’t care about p-values or hypothesis testing. He/she doesn’t care what statistical test or machine learning algorithm you use as long as you are able to answer their business question with the right inference.
Yes, they might appreciate the effort you put through making use of your statistics and machine learning toolbox, but if they do not understand the significance of your explanation or see any value in your findings, they might not make full use of your work to improve their existing product or services.
If you can demystify your work such that a teenage kid can understand its significance, chances are your immediate superior will too.
Let’s go over with a simple example. Suppose you are tasked to determine the customer traffic between Shop A and Shop B in Mall X, both specializing in sports apparel. Naturally you will want to determine the distribution of traffic and check for normality, followed by the proper statistics test for your hypothesis. Now you are going to report your findings to your immediate superior.
A fresh graduate in statistics might say: “I find that the traffic is normally distributed for both Shops A and B and after conducting a simple t-test at 5% significance, I conclude that both shops have different average traffic. Looking at the bar chart, Shop A has higher traffic than Shop B.”
An experienced analyst on the other hand will simplify the findings and further add value to the statement, “Statistically speaking, Shop A has higher traffic than Shop B most of the time. However, we need to understand why Shop A attracts more customers than Shop B. Is it the design of the shoes? Is it the interior design of the shop or is it offering more attractive promotional offers than Shop B? Let’s discuss this matter with the marketing department and gather their thoughts. Perhaps they can shed some light about how we can incorporate certain successful aspects of Shop A and incorporate it into our product.”
See the difference? An experience analyst knows how to speak in line with what product managers want to hear, without throwing in technical terms to confuse them. Many product managers want short but valuable insights about your findings. Only explain your work process if required.
My suggestion: Pick up soft-skills in communication whenever you can outside of your work in the form of courses. If you are still an undergraduate, pick up some sales jobs or maybe some teaching. These are jobs that require you to chew down information into bite sized pieces so that your customer/student understands your explanation.
The Dunning Kruger Effect is a real phenomenon that happens commonly among fresh graduates regardless of their discipline of study. Your school provides adequate resources in preparation for you to enter the workforce. However, your learning does not stop in your final days of your undergraduate life. Attend external conferences, enroll in online courses, network with like-minded people. You will be surprised how much you don’t know!
It is probably the reason why that most data scientist job postings require at least a Masters degree and probably a minimum of 5 years of experience.
Stay hungry for knowledge!