What is Data Science?
“Data science” is the analytical process of discovering insights and trends from data.
Does it seem too simple? Well, that's because it is! Many people believe that Data Science is hard, but that is not the case.
Quite often, the end goal is to inform decisions within organizations and societies. “Data” refers to the information that we can capture, observe, and communicate. For example, a sales receipt, signals sent by a sensor, or a product review are all considered as data points.
Organizations analyze these data points to develop an understanding gained through data insights. As a business, we may want to know “Which store might need the highest restocking next week?”. As a nuclear power plant, we may ask “Is the system operating within safe limits?”.
As a customer, I could wonder “Which is the best product according to the Internet?”. Using the characteristics of science, statistics, and probability, Data scientists generate testable hypotheses. The data results then serve as evidence to support or disprove hypothesized theories. The skills and tools that enable large scale data analysis is a core tenet of Data Science.
History of Data Science
Humans have systematically recorded and analyzed information since the very dawn of history. For instance, the Ishango Bone, discovered in 1960, is one of the earliest pieces of evidence of prehistoric data storage and analysis
. Historians suggest that cave-dwelling humans marked notches into sticks or bones to keep track of trading activity, supplies, and even the lunar calendar cycles.
Fast forward to the modern era. Can you think of a successful business or government that does not analyze data? Since the 1970s scholars have worked on combining the science of data processing, statistics, and computing. The term “data science” can be traced back to Dr. William S. Cleveland in 2001. His action plan motivated universities to design a curriculum for the field.
In the paper, he called for knowledge-sharing between computer scientists and statisticians. He argued that computer scientists could use knowledge about how to approach data analysis.
Likewise, statisticians might find knowledge about computing environments to be helpful. Meanwhile, during the "dot-com" bubble of 1998-2000, prices of hard drives and computers crashed. Corporations and governments accumulated more data and computing resources than ever before. Advances in semiconductor technology
-enabled manufacturers to meet the growing demand. This cycle created the era of “Big Data”. A term used to describe large and complex data sets incapable of analysis through regular database management tools.
The rise of the Internet during the late 90s and early 2000s is another contributing factor to the Big Data era. It allowed for rich multimedia data to propagate. Companies at the time tackled the problem of the best possible way to search across billions of available web pages.
They developed parallel computing-based data processing techniques to address it. These technologies serve as a foundation for data engineering tools available today. Technologies such as MapReduce
and Hadoop allow for computations across Big Data in an efficient manner. It also allowed the theories and tools of data science to be accessible by a larger community.
Data Science Lifecycle
The field of data science has evolved to become useful in almost every field of profession. Think of your favorite sports team, your most trusted bank, or your most frequent fast-food chain. They might use a process like the data science lifecycle at some point to create a strategy. Otherwise, they might be at a disadvantage to a capable rival who does! You may wonder “Exactly what IS the data science lifecycle?”.
It is a way of approaching data and extracting insights from it, which informs your decisions. Think of it as an iterative process with the following stages - Understand, Obtain, Clean, Explore, Model, and Interpret.
The Data Science Lifecycle
1. Understanding Objectives
Understand the business objectives by asking relevant questions. At this stage, we may define a tangible end-goal for our analysis. We may want to measure a performance metric or determine the root cause of an event. We would be prescribing actions to the stakeholders based on these measurable criteria.
2. Obtaining Data
Obtain the data required for analysis. The type of data, its source, and its availability may vary depending on the domain of application. The data may be available for download or derived from another data source, or collected from scratch.
We can categorize data as structured if it comes in the format of tables (of rows and columns). Unstructured data may need some pre-processing to convert into the required formats. For example, web pages, URLs, images, and audio.
3. Cleaning the Data
Clean the data to a format that is understandable by machines. This step is crucial in ensuring that the data is relevant and that the results are valid. We fix the data inconsistencies and treat any missing values within the dataset.
We may use scripting languages like R and Python to clean and pre-process that data into a structured format. Tools such as OpenRefine or SAS Enterprise Miner provide implementations of common data cleaning methodologies.
4. Exploring the Data
Explore the data and generate hypotheses based on the visual analysis of the dataset. This involves obtaining summary statistics across different dimensions of data. Here, dimension refers to a column in a table.
If you are using Python, you may want to make use of packages such as Numpy, Matplotlib, Pandas or Scipy. For R, you may find ggplot2 or Dplyr to be useful. In this step, data scientists generate testable hypotheses about the problem that is being solved. This provides a guiding method for the next step.
5. Modeling the Data
Model the unknown target variable using the data available. A model allows its user to input the data and outputs a prediction or estimation of the unknown variable. The individual columns or dimensions of the data are also known as features (denoted as X). The unknown target variable (denoted as Y) is a quantity we estimate or the outcome we predict.
This could be the price of a share in the future, the demand for a product in a given week, or the decision to approve a loan. A data scientist also determines what effect each feature has on the output. This step is also known as feature engineering.
6. Understanding Feature Engineering
Understanding feature engineering with an example gives in-depth knowledge into the subject better. For example, how would you use the information about the customer's age and income when approving or rejecting a loan application? Building such models and hand-crafting the features requires a lot of manual effort.
Moreover, such models cannot capture trends across large datasets. Recently, research in the field of machine learning (ML) and artificial intelligence (AI) has addressed this problem.
These systems are capable of automatically finding important features from the given data. Analysts use ML/AI to "learn" important features and build accurate models of the target variable. ML/AI models sometimes outperform humans at tasks such as Object, Face, and Voice Recognition. An ML/AI model built for one task may also be useful for another related task.
The disadvantages of ML/AI systems are that they need large datasets, computing resources, and time to train. Besides, the reason why they made a prediction or estimation may not be easy to determine (although that is changing
). The difference between AI and Data Science lies in the fact that AI may be extremely effective in building models, but cannot yet govern the business direction of the larger investigation, or make general reasoning from diverse data (yet!).
7. Interpreting the Data
Interpret the results from the model to stakeholders through impactful storytelling. This step could be the most crucial aspect of the life cycle. We may deliver months of work to stakeholders within the organization that sponsors the study.
This involves not only presenting the results but also prescribing a plan of action. We communicate the most important observations, findings, insights, and recommended actions. This needs to be in a simple and visually impactful format. This requires a combination of ideas from communication, psychology, statistics, and art.
8. Implementing the Data Science Life Cycle
The data science lifecycle sounds simple enough. As we can expect, implementing this for a specific problem in a domain may not be straightforward. It may need a fair bit of business understanding, iterative experimentation, clever intuition, and persistence. Free capital markets do not ignore the edge that data science brings to their organization. The basic process followed by most organizations include:
The output of a data science model might be subtle - for example, it may answer the question “Where should we place product X on the shelf in a superstore?”. Its impact may run into the range of hundreds of thousands of dollars. That explains why data scientists are among the highest-paid
professionals! On the same note, nonprofits use data science for social good as well. Data science has informed marketing operations within NGOs.
They may develop personalized incentive models based on donor information. They may also track and streamline their activities using the data science lifecycle. Governments also better the lives of their citizens through data science systems. Predicting the crosswalk locations to enhance road safety is one such example. There are a variety of application domains, a perceived shortage of data scientists
, and hence large incentives for diving into this field.