Data Science Series EP 5

4 min readSep 22, 2021

Data Preprocessing with Orange Tool

Welcome to the Data Science Blog Series. Do check out my previous blog from the data science blog series here.

Overview:

This blog is 3rd Part of the Orange tool. In which I will be discussing how the Orange library in python can be used to perform various data preprocessing tasks like Discretization, Continuization, Randomization, and Normalization on data with help of various Orange functions.

In the Orange tool’s canvas, take the Python script from the left panel and double click on it.

All the Scripts are available on Github Page

Discretization:

Discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

Continuization:

Continuization refers to the transformation of discrete (binary or multinominal) variables to continuous. The class described below operates on the entire domain; documentation on Orange.core.transformvalue.rst explains how to treat each variable separately.

The typical use of the class is as follows:

continuizer = Orange.data.continuization.DomainContinuizer()
continuizer.multinomial_treatment = continuizer.LowestIsBase
domain0 = continuizer(data)
data0 = data.translate(domain0)

Continuize_Indicators

The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one, and the others will be zero. This is the default behavior.

For example, as shown in the below code snippet, dataset “titanic” has featured “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 10th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.

Normalization:

It is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution ineffectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.

Randomization:

With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

Python Scripts Files:

Data-Science-Series/Practical 6 at main · Rushi-45/Data-Science-Series

Contribute to Rushi-45/Data-Science-Series development by creating an account on GitHub.

github.com

Conclusion:

You must have learnt more features of the Orange Tool related to Data Preprocessing.

Do check out more features of the Orange tool here.

Previous blogs about Orange tool Blog1 & Blog2.

LinkedIn:

Rushi Chudasama - Chandubhai S. Patel Institute of Technology - Bharuch, Gujarat, India | LinkedIn

I am a student perusing my B.Tech 4th year in Information technology at Charotar Institute of Science and Technology. I…

www.linkedin.com

More Projects and Blogs:

Rushi-45 - Overview

Block or Report Forked from AadityaKhetan/Autograder-FrontEnd A web application to aid conduct lab sessions for…

github.com

Blogs:

Rushi Chudasama - Medium

Welcome to the Data Science Blog Series. Do check out my previous blog from the data science blog series here…

rushi-positive.medium.com

Final Note:

Thanks for reading! If you enjoyed this article, please hit the clap 👏button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge. If you like my content follow me on medium I will try to post as many blogs as I can.