I have done data analysis for one of my project using below approach and hopefully it may help you understand underlying subject. Soon i’ll post my project on data analysis and detail description on technology used Python(web scraping- data collection), Hadoop, Spark and R.
Data analysis is a highly iterative and non-linear process, better reflected by a series of cyclic process, in which information is learned at each step, which then informs whether (and how) to refine, and redo, the step that was just performed, or whether (and how) to proceed to the next step.
Setting the Scene
Data analysis is a study of subjective question and study even includes developing and executing a plan for collecting data, a data analysis presumes the data have already been collected. More specifically, a study includes the development of a hypothesis or question, the designing of the data collection process (or study protocol), the collection of the data, and the analysis and interpretation of the data.
Activities of data Analysis
There are 5 core activities of data analysis:
1. Stating and refining the question
2. Exploring the data
3. Building formal statistical models
4. Interpreting the results
5. Communicating the results
1. Stating and Refining the Question
Doing data analysis requires quite a bit of thinking and we believe that when you’ve completed a good data analysis, you’ve spent more time thinking than doing. The thinking begins before you even look at a dataset, and it’s well worth devoting careful thought to your question. This point cannot be over-emphasized as many of the “fatal” pitfalls of a data analysis can be avoided by expending the mental energy to get your question right.
Types of Questions:-
A descriptive question is one that seeks to summarize a characteristic of a set of data. Examples include determining the proportion of males, the mean number of servings of fresh fruits and vegetables per day, or the frequency of viral illnesses in a set of data collected from a group of individuals.
An exploratory question is one in which you analyze the data to see if there are patterns, trends, or relationships between variables. These types of analyses are also called “hypothesis-generating” analyses because rather than testing a hypothesis as would be done with an inferential, causal, or mechanistic question, you are looking for patterns that would support proposing a hypothesis.
An inferential question would be a restatement of this proposed hypothesis as a question and would be answered by analyzing a different set of data.
A predictive question would be one where you ask what types of people will eat a diet high in fresh fruits and vegetables during the next year. In this type of question you are less interested in what causes someone to eat a certain diet, just what predicts whether someone will eat this certain diet. For example, higher income may be one of the final set of predictors, and you may not know (or even care) why people with higher incomes are more likely to eat a diet high in fresh fruits and vegetables, but what is most important is that income is a factor that predicts this behavior.
This will lead to an answer that will tell us, if the diet does, indeed, cause a reduction in the number of viral illnesses, how the diet leads to a reduction in the number of viral illnesses. A question that asks how a diet high in fresh fruits and vegetables leads to a reduction in the number of viral illnesses would be a mechanistic question.
2. Exploratory Data Analysis
Exploratory data analysis is the process of exploring your data, and it typically includes examining the structure and components of your dataset, the distributions of individual variables, and the relationships between two or more variables. The most heavily relied upon tool for exploratory data analysis is visualizing data using a graphical representation of the data.
There are several goals of exploratory data analysis, which are:
a. To determine if there are any problems with your dataset.
b. To determine whether the question you are asking can be answered by the data that you have.
c. To develop a sketch of the answer to your question.
3. Using Models to Explore Your Data
In a very general sense, a model is something we construct to help us understand the real world. But a simple summary statistic, such as the mean of a set of numbers, is not enough to formulate a model. A statistical model must also impose some structure on the data. At its core, a statistical model provides a description of how the world works and how the data were generated. The model is essentially an expectation of the relationships between various factors in the real world and in your dataset. What makes a model a statistical model is that it allows for some randomness in generating the data.
4. Comparing Model Expectations to Reality
Inference is one of many possible goals in data analysis and so it’s worth discussing what exactly is the act of making inference.
a. Describe the sampling process
b. Describe a model for the population(populations is subset of my data
Drawing a fake picture:- To begin with we can make some pictures, like a histogram of the data.
Reacting to Data: Refining Our Expectations
Okay, so the model and the data don’t match very well, as was indicated by the histogram above. So what do do? Well, we can either
a. Get a different model
b. Get different data
5. Interpreting Your Results and Communication
Communication is fundamental to good data analysis. You gather data by communicating your results and the responses you receive from your audience should inform the next steps in your data analysis. The types of responses you receive include not only answers to specific questions, but also commentary and questions your audience has in response to your report.