How to Become a Data Scientist – Part 1
I cannot say that becoming a data scientist is the most glamorous job but I definitely agree that it is one of the most challenging and rewarding jobs.
While there is no standard definition of what a data scientist does, the role of a data scientist involves working with data to identify meaningful patterns and insights that are otherwise hidden with the objective of helping businesses take data-driven decisions.
To do this a data scientist will require a mix of skills and abilities. He should have strong analytical skills with a solid aptitude for maths and statistics. To be able to apply statistical techniques on the data, the data scientist should also have programming skills and be familiar with programming tools and languages such as R and python. Since a data scientist will work with real world data, there is one more challenge. You will hardly find a case where you get clean organized data. In most cases, data will have gaps, and will have missing context. The data scientist’s job also involves finding the right data, retrieving it from multiple systems, and then wrangling it to make it suitable for analysis. In most of the cases, a data scientist will work on solving business problems, such as identifying customer buying behaviour to improve their shopping experience, or to predict certain behaviour in the future. In this context, it is essential that a data scientist adopts a problem solving attitude and has a contextual understanding of how business runs in order to derive meaningful results.
Let’s look at some examples of the kinds of problems that data scientists work on.
- Predict Click-through rate: You work with a mobile advertising company, and you have been asked to predict if a given mobile ad will be clicked or not. To do so you will work with historical ads data with various attributes such as ad formats, ad sizes, category of products being advertised, the mobile devices where ads are displayed and a ton of other factors. Using all this data you will build a model that could predict the chances of a particular mobile ad being clicked.
- Sentiment analysis: There are many use cases for sentiment analysis. For example, using twitter data to find out what people in US think about Indian food. The data scientists in this case, will analyse tweets to find their answers. The insights in tweets will help you in your market research while planning to open Indian restaurant chain in the US. It could also provide you other insights such as in which states there would be more demand for Indian food compared to others. Or, you could find what people in the US like or not like about Indian food. For example, ‘Too spicy’.
- Predict Product Sales: Based on a data set of product features and historical sales of product, predict the online sales of a consumer product.
- Patient admissions: Identify patients who will be admitted to a hospital within the next year using historical claims data.
- Repeat buyers: For a retail store, customers who return to make purchases after an initial offer or incentive are their most valuable customers. Given the purchase history of thousands of customers over a period of time, a data scientist can predict which customers will return to buy an item when presented with an offer.
Data science has application in every field be it retail, healthcare, financial services, or sports.
While working on a typical data science problem, a data scientist generally follows the following path:
- Identify the problem statement, for example, predict product sales.
- Gather the data required to perform the analysis/prediction. For example, monthly online sales of the product, advertising campaigns that ran to promote the product, product features, etc. The data may reside in multiple source systems, so the data scientists will have to retrieve data from these systems such as databases, flat files, APIs, etc.
- Analyse the data for gaps and errors and prepare data for further analysis. For example, the data scientist may find that some important data points are missing such as the time of sale for certain records may be missing. In such cases, the data scientist will have to make some assumptions about these missing points. Similarly, since data is coming from different systems, he will have to find the relationship points between these different data such as sale data, and advertising data.
- Once the data is prepared, the data scientist will split the data between a training set and a test set. For example, if we had sales data from 2010 to 2014, we may want to use the data from 2010 to 2013 for developing our prediction model (training set) and once the model is ready, use that model to predict sale for 2014 and compare the results with the 2014 data (test set). This way the data scientist can validate how well the model works with actual data.
- Develop the model using the training set. This is the most crucial stage of the data science project where the data scientist will actually work on the data to identify patterns and develop his prediction model/algorithm.
- Test the model: Once the model is ready, the data scientist will make predictions using the model and compare the results with the test data and see how well the model fits the test data.
- Refine the model: Model building is an iterative process. Based on the results, the data scientist will go back to his model and try to make changes/adjustments to improve his results.
- Present the results to business users: Once the model is ready, the data analyst will present the results to business users through data visualization. To do so he will have to create interesting and meaningful visualizations that make sense for business people and can help him justify his findings in a clear and simple manner.
While performing all the above activities, the data scientist will make use of multiple tools and skills. In the next article, I will walk you through the essential toolkit that a data scientist must have in his arsenal.