1
$\begingroup$

I am doing linear regression with multiple variables. In my data I have n = 143 features and m = 13000 training examples. Some of my features are continuous (ordinal) variables (area, year, number of rooms). But I also have categorical variables (district, color, type). For now I visualized some of my feautures against predicted price. For example here is the plot of area against predicted price: enter image description here

Since area is continuous ordinal variable I had no troubles visualizing the data. But now I wanted to somehow visualize dependency of my categorical variables (such as district) on predicted price. For categorical variables I used one-hot (dummy) encoding.
For example that kind of data:
enter image description here

turned to this format: enter image description here

If I were using ordinal encoding for districts this way:

DistrictA - 1 DistrictB - 2 DistrictC - 3 DistrictD - 4 DistrictE - 5 

I would plot this values against predicted price pretty easy by putting 1-5 to X axis and price to Y axis.

But I used dummy coding and now I do not know how can I show (visualize) dependency between price and categorical variable 'District' represented as series of zeros and ones.

How can I make a plot showing a regression line of districts against predicted price in case of using dummy coding?

$\endgroup$
2

1 Answer 1

1
$\begingroup$

One possible first step is to convert the data back to the original coding. This is called in SQL unpivot, in R melt.

Here an R example

> my.df <- read.table( + text = "DistrictA DistrictB DistrictC DistrictD DistrictE Price + 1 0 0 0 0 10000 + 0 1 0 0 0 20000 + 0 0 1 0 0 30000 + 0 0 0 1 0 40000 + 0 0 0 0 1 50000" + , header = TRUE) > my.df DistrictA DistrictB DistrictC DistrictD DistrictE Price 1 1 0 0 0 0 10000 2 0 1 0 0 0 20000 3 0 0 1 0 0 30000 4 0 0 0 1 0 40000 5 0 0 0 0 1 50000 > library(reshape) > subset(melt(my.df, id="Price", variable = "District"),value == 1)[,c(1,2)] Price District 1 10000 DistrictA 7 20000 DistrictB 13 30000 DistrictC 19 40000 DistrictD 25 50000 DistrictE 

After that you plot the Price dependent on a factor variable. You may additionally consider to order the factor based on the predicted price.

I provide no details, as you don't tagged your tool, but I would recommend additional to a scatter plot to consider a box plot and/or density plot - always combined with the prediction value from the model for each factor level.

$\endgroup$

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.