Data Science EP 11(Practical Exam)
Dataset Description using Orange tool
- Creating Your Workflow
This is your blank Workflow on Orange. Now, you’re ready to explore and solve any problem by dragging any widget from the widget menu to your workflow.
2.1 Problem
The problem we’re looking to solve in this tutorial is the practice problem of Cancer Risk Prediction that can be accessed via this link on Datahack.
2.2 Importing the data files
We begin with the first and the necessary step to understand our data and make predictions: importing our data.
Step 1: Click on the “Data” tab on the widget selector menu and drag the widget “File” to our blank workflow.
Step 2: Double click the “File” widget and select the file you want to load into the workflow. In this article, as we will be learning how to solve the practice problem Loan Prediction, I will import the training dataset from the same.
Step 3: Once you can see the structure of your dataset using the widget, go back by closing this menu.
Step 4: Now since we have the raw .csv details, we need to convert it to a format we can use in our mining. Click on the dotted line encircling the “File” widget and drag, and then click anywhere in the blank space.
Step 5: As we need a data table to better visualize our findings, we click on the “Data Table” widget.
Step 6: Now double click the widget to visualize your table.
Let’s now visualize some columns to find interesting patterns in our data.
3. Preprocess the data to overwrite missing values
On the image above you can see preprocessed data, where all the missing values have been replaced by average values.
4. Applying and evaluating preprocessed data to different models
Step-1: Select columns
Now, we have selected a column for further process.
Step-2: Applying dataset to models
All our preprocessed data has been applied to different models like Random Forest, Naive Bayes, Nural Network, and kNN.
Step-3: Testing dataset
Here, you can see the score of different models which has been derived.
Step-4: Confusion Matrix
We have derived a confusion matrix of the given dataset. Also, we can see which values are correct and which are misclassified according to a given model with this.
5. Processing of given data and analyzing them
Step-1: Here we have a dataset that needs to be Encoded, Normalized, and also missing values should be handled properly.
As you can see the data have been processed in the way the examiner wanted us to.
After all the processing the dataset will look like the image below this.
Step-2: Applying dataset to models
All our preprocessed data has been applied to different models like Random Forest, Naive Bayes, Nural Network, and kNN.
Step-3: Testing dataset
Here, you can see the score of different models which has been derived.
Step-4: Confusion Matrix
We have derived a confusion matrix of the processed dataset. Also, we can see which values are correct and which are misclassified according to a given model with this.
6. Power BI
Now I will show you a Graph view of our preprocessed dataset
Conclusion
Orange is a platform that can be used for almost any kind of analysis but most importantly, for beautiful and easy visuals. In this article, we explored how to visualize a dataset. Predictive modeling was undertaken as well, using a logistic regression predictor, SVM, and a random forest predictor to find loan statuses for each person accordingly.
Hope this tutorial has helped you figure out aspects of the problem that you might not have understood or missed out on before. It is very important to understand the data science pipeline and the steps we take to train a model, and this should surely help you build better predictive models soon!
LinkedIn:
More Projects and Blogs:
Blogs:
Final Note:
Thanks for reading! If you enjoyed this article, please hit the clap 👏button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge. If you like my content follow me on medium I will try to post as many blogs as I can.