Either you are a beginner or a proficient data scientist and/or machine learning engineer, there is always a lot to learn from Kaggle . Kaggle is a competition platform and provides us with variety of datasets. You can also read very interesting kernels written by many competitors. It is really helpful to understand different perspectives from different data scientists.
If you are a beginner, try with very simple competition such as “the titanic dataset” . Though there are not very specific and solid step by step approach to solve any data science problem, it is always useful to know some of the initial steps that you should be aware of before solving any data science task. Some of the steps that I find useful are:
1. Understand the problem : You must be clear on what you are doing or what you are expected to solve before attacking any problems. It gives you a big picture and some expectation about the results.
2. Collect the requirements : Most important requirement is the data. You need to have proper information about what data is needed and what are the sources to obtain all the needed data. Data types, sizes and sources may vary based on the problems you are tackling. You need a very clear idea how to handle this part. Data engineers can help you in this section.
3. Know the data architecture : Every organization has its own data architecture, which includes support for data storage (Databases – SQL, NoSQL) or data in motion (real time data access), Data governance (data standards, rules and regulations), Data stewardship (Quality control), data flow and ETL etc. This gives a clear feeling on how to handle further data science steps such as data wrangling, feature engineering, hyprerparameters tuning and applying relevant machine learning or deep learning algorithms as per requirement.
Once you know the problem to solve, you have data in hand and you know what about the data architecture; you can:
4. Prepare the data dictionary : This is the initial data analysis approach, where you understand the parameters and features of your dataset. This requires some domain knowledge. Much deeper domain knowledge is not required, however you need to understand what each of the terms mean and why do they exist in the dataset. You should really get rid of the unwanted data from your model. For example, if you have downloaded the “the titanic dataset” , you need the following data dictionary:
1. Survived: Survival 0 = No, 1 = Yes
2. Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
3. Sex: Male or Female
4. Age: Age in years
5. SibSp: # of siblings / spouses aboard the Titanic
6. Parch: # of parents / children aboard the Titanic
7. Ticket: Ticket number
8. Fare: Passenger fare
9. Cabin: Cabin number
10. Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes
1. Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
2. Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx
3. SibSp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
4. Parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
You can further follow “this kernel” that I have created in Kaggle to understand the detailed steps and analysis for the titanic dataset.