Data Science Series EP 5

Rushi Chudasama
4 min readSep 22, 2021

Data Preprocessing with Orange Tool

Welcome to the Data Science Blog Series. Do check out my previous blog from the data science blog series here.

Overview:

This blog is 3rd Part of the Orange tool. In which I will be discussing how the Orange library in python can be used to perform various data preprocessing tasks like Discretization, Continuization, Randomization, and Normalization on data with help of various Orange functions.

In the Orange tool’s canvas, take the Python script from the left panel and double click on it.

4.1 Python Script Widget

All the Scripts are available on Github Page

Discretization:

Discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

4.2 Discretization using Python Script

Continuization:

Continuization refers to the transformation of discrete (binary or multinominal) variables to continuous. The class described below operates on the entire domain; documentation on Orange.core.transformvalue.rst explains how to treat each variable separately.

The typical use of the class is as follows:

continuizer = Orange.data.continuization.DomainContinuizer()
continuizer.multinomial_treatment = continuizer.LowestIsBase
domain0 = continuizer(data)
data0 = data.translate(domain0)

Continuize_Indicators

The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one, and the others will be zero. This is the default behavior.

For example, as shown in the below code snippet, dataset “titanic” has featured “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 10th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.

4.3 Continuization using Python Script

Normalization:

It is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution ineffectiveness of an important equally important attribute(on a lower scale) because of other attributes having values on a larger scale. We use the Normalize function to perform normalization.

4.4 Normalization using Python Script

Randomization:

With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

4.5 Randomization using Python Script

Python Scripts Files:

Conclusion:

You must have learnt more features of the Orange Tool related to Data Preprocessing.

Do check out more features of the Orange tool here.

Previous blogs about Orange tool Blog1 & Blog2.

LinkedIn:

More Projects and Blogs:

Blogs:

Final Note:

Thanks for reading! If you enjoyed this article, please hit the clap 👏button as many times as you can. It would mean a lot and encourage me to keep sharing my knowledge. If you like my content follow me on medium I will try to post as many blogs as I can.

--

--