I'm trying to create dummy variables for a variable that has text data in rows.
Data in 1st row is:{"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}
and Data in 2nd row is:{TV,"Cable TV",Internet,"Wireless Internet",Kitchen,"Indoor fireplace","Buzzer/wireless intercom",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit","Fire extinguisher",Essentials}
and many more.
What I now want to do is, to create dummy variables out of that variable. For example from the above data:
one variable named Wireless Internet
with 0
ans 1
in rows &
another variable named Cable TV
with 0
and 1
in rows &
another variable named Kitchen
with 0
and 1
in rows and so on.
sklearn
for python has OneHotEncoder
class which creates dummy variable named everything in a row considering all rows with unique values. That is not what I want to do here. I first have to split text in all rows and create dummy variables for them. How do I do that?
Expected results are, multiple columns likeWireless Internet Cable TV Kitchen
1 0 1
0 1 1
1 0 1
link to data(column named
amenities
) - https://www.kaggle.com/stevezhenghp/airbnb-price-prediction