Get started

By signing up, you agree to our Terms of Use and Privacy Policy.
Reset your password
Enter your email and we'll send you instructions on how to reset your password.

  • Home
  • Blog
  • Big Data
  • Top 20 questions asked during an interview for a data analyst position
Top 20 questions asked during an interview for a data analyst position

Top 20 questions asked during an interview for a data analyst position

So you’ve been looking to make a career move to data analytics (also consider checking out this perfect parcel of information for data science degree). You’ve probably taken up a course, done your homework. You’ve sought out firms for the position of a data analyst. It took a while, but a firm has responded back, showing interest. They fix up an interview with you. You begin googling things to anticipate how the interview will go. And now, you’re on this page. 

Congratulations on getting yourself the interview!

Without further ado, let’s jump right into the questions, shall we?

Q1. What is the difference between statistics and machine learning?

You know there’s a difference. It’s almost obvious. You immediately begin to explain to the interviewer what you know. You’ve said too much and too little. You realize the interviewer wants a crisp answer, and quick and your fumbling doesn’t help. 

For your answer, we take a look at the work of Bzdok et al in their nature methods research paper. 

Nature Methods Research Paper

According to them, the difference is that “statistics draws population inference from a sample, machine learning finds generalizable predictive patterns.

The inference is a process to test a hypothesis while prediction is a process to forecast future behavior. Statistics is focused on possible relationships between variables in the form of mathematical equations. However, machine learning consists of algorithms that can learn from data without relying on rule-based programming

That said, statistical knowledge is an essential prerequisite for machine learning - it provides base and key steps to perform machine learning effectively. 

Q2. What is the difference between analysis and analytics?

It is common to see these words being used interchangeably. There is however a world of difference. Analysis answers key questions during past events, while analytics forecasts the future based on analyzed data. We examine past events using analysis to identify if there was a decrease or increase in sales last summer or in any other specific month. Analytics is performed to predict future events and make educated guesses and extensions of trends, instead of explaining past events. Analytics can be seen as the application of computational and deduction techniques. For example, using past sales data we can discern customer’s behavior to come up with strategies to increase sales in the future.    

Q3. What was your role as a data analyst in your previous position?

This is the most frequently asked question during an interview. So if you have prior experience, then cite accomplishments and skills along the lines of the following : 

  • Designing and maintaining databases

  • Collecting Data and overcoming issues of redundant, incorrect or absent data

  • Data mining and data wrangling to make a more easily readable format of the data

  • Performing statistical analysis for data interpretation, specifically exploring possible trends or patterns using data visualization methods

  • Preparing reports to the colleagues and present it during meetings 

  • Explaining the significance of findings and its effect in relevant field

  • Collaboration with computer programmers, engineers, researchers, and other organizational leaders to enhance business needs based on statistical inferences and forecasting

  • Implementation of new data analytics methods (also consider checking out this career guide for data science jobs)

These are the most common roles of a data analyst in any field. So prepare beforehand and show examples where you've performed the above functions. Your answers may have slight changes based on your previous experience as a data analyst. For example, in a research field, data analysts may also have to show contributions in writing research papers and present relevant findings in seminars. For those without prior experience, citing project work and some of the above functions will give a very good impression to the interviewer.

Q4. What is the difference between supervised and unsupervised learning?

Supervised and unsupervised machine learning are two key concepts of data science. It is imperative that you understand and explain clearly to the interviewer these terminologies when asked. The majority of tools used by data scientists to predict future outcomes rely on supervised machine learning techniques.

Methods used in supervised learning require prior supervision to train the model. Model training is done by using labeled data with the known outcome  (known values of dependent variables). The trained model is later utilized to predict the dependent value of new data given new independent values. Two major methods used in supervised learning are classification and regression.

Classification alludes to categorizing data into sets that have their specific traits and regression essentially involves coming up with function values beyond the range of the data already available.

Categories of Supervised Learning

 

On the other hand, unsupervised machine learning does not require any prior training of the model. There is no prior knowledge of the output values of our data set.  The main goal of unsupervised learning is to find the hidden structure in data. Algorithms used in unsupervised machine learning are used to draw inferences from data without labeled response (Here's the perfect parcel of information to learn data science). Two major examples of unsupervised machine learning are clustering and association. As the names suggest, clustering algorithms are aimed at unearthing groups within the data set and association is aimed at discovering relationships within the data points.

Categories Of Unsupervised Learning

Q5. How do you deal with missing data?

Missing data is ubiquitous in virtually any raw data set and is a peril that plagues any data scientist. Missing data can easily skew the conclusions of any analysis performed. That said, however, certain data holes are better than others. Data missing completely at random (MCAR) and data missing at random (MAR) do not increase too much bias in the study. However, data missing not at random (MNAR) is a major issue.

Few techniques have been suggested to handle missing data, either random or non-random:

  • Plan your study and data collection carefully

  • Form of manual of operations at the beginning of the study 

  • Provide training to all personals who might be associated with data

  • Document the missing data especially in case of eliminating it

  • Use data analysis methods to handle missing data

  • Adopt common approaches like listwise, pairwise or case-wise deletion

  • Mean substitution

  • Imputation

A detailed explanation of these methodologies can be found here. Note that certain methods mentioned above may depend on the field of study or the case, so answer accordingly if an example is given. 

Q6. What tools have you used before the purpose of data analysis?

The tools a data analyst needs to familiar might be very domain and project-centric, but here are a handful of general tools data analyst use:

  • A programming language, such as Python, R, SAS, Apache Spark, SAS, Tableau public, Stata, etc

  • Being able to work with relational databases, such as MySQL, SQL server etc

  • Reverse ETL tools to send data to business applications
  •  
  • Extracting, transforming and loading data, such as CloverETL, Talend, RapidMiner, Beautiful Soup
  • Hadoop

  • Tools related to data visualization such as matplotlib and seaborn in Python, and ggplot2 in R

  • Libraries associated with statistical analysis and machine learning such as sklearn in python 

Q7. What is the difference between a sample and the population?

It is important to know if the data we are dealing with is a population or a sample. A population is the collection of all items of interest in our study. It is denoted as N. A sample is a subset of the population and denoted as n. We gather information about the population through statistical methods or inferential means.

Samples and populations

 

Q8. What is the difference between classification and regression?

Classification and regression are types of supervised machine learning. Both regression and classification use training data set to predict the outcome on the test or new datasets. The goal of classification is to predict the category of a new observation. However, in regression, we aim to estimate or predict response or quantity.

Q9. What is imputation and what are the different types of imputation?

Imputation is a statistical process to replace missing data with values. In this process, we do not remove any variable or observations with a missing value.

There are many imputation methods utilized in data analysis. Some examples are mentioned below:  

  • Mean imputation

  • Substitution

  • Cold deck imputation

  • Hot deck imputation

  • Nearest neighbor

  • Regression imputation

  • Multiple imputations

  • Single imputation methods like last observation carried forward (LOCF) and baseline observation carried forward

  • Maximum likelihood such as expectation-maximization

Imputation

Q10. What is the KNN imputation method?

K-nearest neighbor (KNN) is a classification approach to supervised machine learning. Nearest neighbor (NN) approaches are donor-based methods. KNN is a type of NN approach and uses average of measured values of the neighbors, or weighted mean (distance to neighbors are used as weights). Concepts of NN approaches are elucidated here

Q11. What do you mean by normal distribution?

Normal distribution is the most common continuous probability distribution. It is also known as the bell-shaped curve or the Gaussian distribution. Normal distribution most commonly suggests that most data occurs near the mean and dwindles down the farther you go. In case of the normal distribution, mean we expect certain observations:

  • Mean is 0 and the standard deviation is 1

  • Skewness is 0

  • Kurtosis is 3

Normal Distribution

Q12. Why do we use non-parametric methods? What are some examples?

Nonparametric approaches do not rely on the specific form of the data or its sizes (n) - such as the particular parametric family or the probability distribution. Non-parametric techniques use lesser assumptions about the nature of the underlying distribution.This is why they are also known as distribution-free methods. It is most commonly used when there is an unknown distribution in population data or when sample size is small (n < 30). The most common statistical techniques are mentioned below:

  • Anderson-darling test

  • Kaplan-Meier

  • Kendall’s tau

  • Kolmogorov-Smirnov test

  • Kruskal Wallis

  • Log-rank test

  • Mann-Whitney U or Wilcoxon rank-sum test

  • McNemar’s test

  • Sign test

  • Spearman’s rank correlation test

  • Wilcoxon signed-rank test

Q13. What is the difference between data profiling and data mining?

Data mining is the collective use of quantitative methods (clustering, classifications, neural networks, etc) to extract knowledge in the form of patterns, correlations, or anomalies from large amounts of data. The results obtained are used to predict outcomes. 

David Loshin explained data profiling as a process of analyzing raw data for the purpose of characterizing the information embedded within a dataset. Different statistical and analytical algorithm of data profiling helps to gain insight into the content of the dataset and qualitative characteristics of those values.  

Q14. What is the Kmeans algorithm?

Kmeans is an unsupervised clustering approach. Like other clustering approaches, Kmeans is also used as an exploratory data analysis to figure-out any possible inference or any hidden structure in the data. It uses a predefined number (K) of non-overlapping clusters. Data points in the Kmeans algorithm are homogenous within clusters. However, it keeps the number of clusters as far as possible. Steps for Kmeans are mentioned below:

  • Choose the number K of clusters

  • Select at random k points, the centroids

  • Assign each data point to the closest centroid

  • Compute and place the new centroids of each cluster

  • Reassign each data point to the new closest centroids, until reassignment is over.

Q15. What is logistic regression?

In many instances when performing logistic regression, the response of interest (Y) or outcome variable is dichotomous/binary rather than continuous. The model uses a binary response variable as the outcome is known as the logit model or logistic regression

Logistic Regression Formula

This logistic function, rather than following a straight line, follows a sigmoid curve. 

Logistic Regression

We use logistic regression analysis as a regression approach for inferences. Nevertheless, in machine learning, it falls under the classification approach. 

Q16. What does data cleansing mean?

Data cleansing or data cleaning is an exhausting, and yet, very necessary step in data analysis. The initial raw data is far from ideal in most cases. It may have incorrect, incomplete, or duplicated data. There is a possibility of spelling errors, changes in the unit of data, creating new columns based on previous columns.

Data cleaning does not imply eliminating data. Rather it is a process to enhance the quality of data. If you input garbage in a model, your output or final results will also be garbage. Hence, data cleansing a pivotal step in data science.  

Q17. What is data validation and what methods are used in data analysis for performing data validation?

The examination of accuracy and quality of data prior to analysis is called data validation. In other words, we can say it is a form of data cleansing. Here is a list of methods used for data validation:

  • Source system loopback verification

  • Ongoing source to source verification

  • Data-issue tracking

  • Data certification

  • Statistics collection

  • Workflow management

Q18. Name some common problems you faced as a data analyst?

There are few problems a data analyst may have to face. Most of these problems are associated with the data cleansing issue. Some of these examples are mentioned below. 

  • Missing values

  • Wrong values

  • Duplicate data

  • Spelling errors

  • No proper information or record of variables

Q19. What is an outlier?

In statistics, outliers are data points that do not follow the common behavior like the majority of other data points. Outliers significantly differ from other data points. The inclusion of outliers distorts the real associations, hence false interpretation or prediction. They are a well known potential threat. They either occur by chance or through measurement error. Outliers are most commonly detected by the box plot method or standard deviation method (± 3SD). Here is an illustration.

Outier

Q20. What are statistical/ML methods have you worked most commonly with?

The answer to this particular question depends on your work in previous positions. Your answer may generate more relevant questions associated with answered statistical methods. The most common statistical methods used by a data analyst are:

  • T-test

  • ANOVA

  • Correlation

  • Chi-squared

  • Regression analysis 

    • Simple linear regression

    • Multiple linear regression

    • Polynomial regression

    • Support vector regression

    • Decision tree regression

    • Random forest regression

    • Ridge regression

    • Lasso regression

  • Classification analysis

    • Logistic regression

    • K-nearest neighbors

    • Support Vector Mechanisms

    • Kernel SVM

    • Naïve Bayes

    • Decision Tree Classification

    • Random Forest Classification

  • Cluster analysis

    • K-Means clustering

    • Hierarchical clustering 

  • Re-sampling methods

  • Natural Language Processing

  • Deep learning

    • Artificial neural network

    • Convolutional neural network

  • Reinforcement learning

    • Upper Confidence bound

    • Thompson sampling

  • Dimensionality reductions

    • Principal Component Analysis

    • Kernel PCA

    • Linear Discriminant Analysis

All of these above-mentioned techniques are essential to learning for a data analyst. There is a huge possibility of many questions from these statistical or machine learning methodologies.

Conclusion

These questions are not exhaustive. For example, I haven't included questions about specific tools. The goal was to address the domain level questions that interviewers ask. Outside of these questions, here are a few key tips to help you :

  • If you don't understand the interviewer's question clearly, ask him or her to rephrase the question politely.
  • Explain your thought process with which you arrived at the answer.
  • Do not lose composure if you don't know the answer to a question. Be honest and say that you do not know the answer.
  • Speak clearly, and at a moderate pace. If you speak too fast, you'll come off as anxious and the interviewer may not catch all the points you've made.

If you're having an online interview, then take a look at this article for interview tips over video conferencing.

Become a certified Data Scientist. Click to see our certification courses and programs.

Recommended Courses

PMI-ACP® Certification Training
Location: Over the web
Dates: September 25,26 October 02 2021
Timings: 10:00 AM - 06:00 PM ET
CAPM® Certification Training
Location: Over the web
Dates: September 25,26 October 02 2021
Timings: 10:00 AM - 06:00 PM ET
USD 700
USD 800
Guaranteed to Run
View Details
Dates: September 25,26 October 02,03 2021
Timings: 10:00 AM - 06:00 PM ET
USD 1,200
USD 1,500
Guaranteed to Run
View Details
Dates: September 25,26 October 02,03 2021
Timings: 10:00 AM - 06:00 PM ET
PMP® Certification Training
Location: Over the web
Dates: September 25,26 October 02,03 2021
Timings: 10:00 AM - 06:00 PM ET
USD 1,300
USD 1,500
Guaranteed to Run
View Details

About The Author

Dr. Sandeep Kumar Singh has received his Ph.D. degree in Public Health from Florida International University, Miami, USA. He got his specialization in cancer genomics, specifically related to pediatric leukemia and the human leukocyte antigen region. As a postdoctoral researcher, at Florida International University and The Ohio State University, he honed his skills in data science. Dr. Singh has more than 35 international presentations and research publications. Apart from data science, Dr. Singh is vastly experienced in Immunogenetics. He also received his Master’s degree from Western Kentucky University, Bowling Green, USA. During his graduation, he also worked as a volunteer to teach at Juvenile Detention Centre at Warren County, Kentucky. Dr. Singh is a nature enthusiast, likes gardening, and reading autobiographies of cricketers.

Sandeep Singh

0 Comments

Add Comment

Subject to Moderate