Categorical data contains data that are labels as opposed to numerical values. One hot encoding Is a method to convert categorical data to numerical data.
from numpy import array from numpy import argmax from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder myData = [ 'dog' , 'cat' , 'sheep' , 'lizard' , 'lizard' , 'cat' , 'lizard' , 'dog' , 'dog' , 'cow' ] # convert to an array myData = array(myData) # encode as integers myData_encoder = LabelEncoder() myData_encoded = myData_encoder.fit_transform(myData) print (myData_encoded) # binary encode onehot_encoder = OneHotEncoder(sparse=False) # disable sparse return type # reshape the array myData_encoded = myData_encoded.reshape(len(myData_encoded), 1) onehot_encoded = onehot_encoder.fit_transform(myData_encoded) print(onehot_encoded)
[2 0 4 3 3 0 3 2 2 1]
[[0. 0. 1. 0. 0.]
[1. 0. 0. 0. 0.]
[0. 0. 0. 0. 1.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 1. 0.]
[1. 0. 0. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0.]]
To retrieve for example the last samples original label was can use:
inverted = myData_encoder.inverse_transform([argmax(onehot_encoded[0,:])]) print(inverted) inverted = myData_encoder.inverse_transform([argmax(onehot_encoded[1,:])]) print(inverted) inverted = myData_encoder.inverse_transform([argmax(onehot_encoded[2,:])]) print(inverted)
['dog']
['cat']
['sheep']
As can be seen above, with [0,:] for example, we are selecting the first row (the x value), and asking which of the all values in that row is closest to 1 (by using argmax). In a neural network this is very useful because it will give an indication of which label has the highest probability of being correct.
OneHotEncoder()
Some of the code is deprecated above and has been/ is being replaced by the use of onehotencoder(). The following is an example of using it to create the same results as above.
from numpy import array from sklearn.preprocessing import OneHotEncoder myData = [ 'dog' , 'cat' , 'sheep' , 'lizard' , 'lizard' , 'cat' , 'lizard' , 'dog' , 'dog' , 'cow' ] myData = array(myData) # make a 2D array myData = myData.reshape(len(myData), 1) # set up the OneHotEncoder enc = OneHotEncoder(sparse=False) # One hot Encode the data myData = enc.fit_transform(myData) display(myData.shape) display(myData) inverted = enc.inverse_transform(myData) print (inverted) print (inverted[0, ]) print (inverted[1, ]) print (inverted[2, ])
(10, 5)
array([[0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 1., 0., 0., 0.]])
[['dog']
['cat']
['sheep']
['lizard']
['lizard']
['cat']
['lizard']
['dog']
['dog']
['cow']]
['dog']
['cat']
['sheep']