“The dirty secret of data science / data analysis right now is that what everyone talks about is machine learning, Kaggle competitions, “deep learning”, or other things. This is because it’s sexy and it sounds good. The truth is that 80% of what people actually do is data munging and data visualization.”
Most courses on data analysis try to teach you a bit about everything, and end up becoming full PhD level degrees that will take 5-10 years to complete.
For this course, we will take a slightly different approach. Rather than study everything there is about data analysis, we will take a few practical examples and analyse them in Python, learning the mindset you need to work with large datasets.
Pre-requisites: Basic knowledge of Python.
Topics we will cover
1. Introduction to Pandas, SQL, simple plotting with small datasets.
2. We have a large 4GB csv file that will not open in any application. What do we do? We transfer it to a SQL database of course. This makes it easy to query and ask questions.
3. The UK government releases a Excel file with all accidents in the UK over the last 30 years. Though it is 700MB, which isn’t very large at all, Excel still fails to open it.
The problem is, the file is too large to analyse by hand. What if we just want to look at accidents that occurred in London in July 2000 and save them to another Excel file? What if we want to know about accidents that killed at least 10 people due to snow in London? Pandas makes this sort of analysis quite easy, and that’s what we’ll look at.
4. Many times, the data we want is spread all over in multiple files, and we have to munge it together to make any sense of it. The analytics to my website are spread over different csv files. We will see how to combine them to get coherent results.
5. The Enron email set. More than 2.5 GB, they contain thousands and thousands of emails sent by Enron employees before the company went bust. We will look at how to analyse file where data is spread over thousands of text files in hundreds of folders, and finding data is like sorting through a haystack.
Specifically, we will find out who sent and received the most emails at Enron, and which domains were most popular.