One Hot Encoding – OneHotEncoder()

Categorical data contains data that are labels as opposed to numerical values. One hot encoding Is a method to convert categorical data to numerical data.

from numpy import array 
from numpy import argmax 
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OneHotEncoder

myData = [ 'dog' , 'cat' , 'sheep' , 'lizard' , 'lizard' , 'cat' , 'lizard' , 'dog' , 'dog' , 'cow' ]

# convert to an array
myData = array(myData)

# encode as integers
myData_encoder = LabelEncoder()
myData_encoded =  myData_encoder.fit_transform(myData) 
print (myData_encoded)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False) # disable sparse return type
# reshape the array
myData_encoded = myData_encoded.reshape(len(myData_encoded), 1) 
onehot_encoded = onehot_encoder.fit_transform(myData_encoded) 

print(onehot_encoded)
 
[2 0 4 3 3 0 3 2 2 1]

[[0. 0. 1. 0. 0.]
[1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0.]]

To retrieve for example the last samples original label was can use:

inverted = myData_encoder.inverse_transform([argmax(onehot_encoded[0,:])]) 
print(inverted)
inverted = myData_encoder.inverse_transform([argmax(onehot_encoded[1,:])]) 
print(inverted)
inverted = myData_encoder.inverse_transform([argmax(onehot_encoded[2,:])]) 
print(inverted)
['dog'] 
['cat']
['sheep']

As can be seen above, with [0,:] for example, we are selecting the first row (the x value), and asking which of the all values in that row is closest to 1 (by using argmax). In a neural network this is very useful because it will give an indication of which label has the highest probability of being correct.

OneHotEncoder()

Some of the code is deprecated above and has been/ is being replaced by the use of onehotencoder(). The following is an example of using it to create the same results as above.

from numpy import array
from sklearn.preprocessing import OneHotEncoder

myData = [ 'dog' , 'cat' , 'sheep' , 'lizard' , 'lizard' , 'cat' , 'lizard' , 'dog' , 'dog' , 'cow' ]
myData = array(myData)

# make a 2D array
myData = myData.reshape(len(myData), 1) 

# set up the OneHotEncoder
enc = OneHotEncoder(sparse=False)

# One hot Encode the data
myData = enc.fit_transform(myData)

display(myData.shape)
display(myData)

inverted = enc.inverse_transform(myData)

print (inverted)

print (inverted[0, ])
print (inverted[1, ])
print (inverted[2, ])
(10, 5)
array([[0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 1., 0., 0., 0.]])

[['dog']
['cat']
['sheep']
['lizard']
['lizard']
['cat']
['lizard']
['dog']
['dog']
['cow']]

['dog']
['cat']
['sheep']

Leave a Reply