2
$\begingroup$

I want to understand the required steps that need to be taken into account while handling a dataset that does not have a target variable. I can do machine learning on top of a labeled dataset having a target variable, but not sure what would be the best way to start with a dataset where is there is no target variable.

I need a step by step guide to achieve an efficient clustering at the end.

Do I need to do the following in order to achieve that?:

  1. Data Cleansing
  2. EDA
  3. Encoding and scaling
  4. Model build
  5. Validation

Or are there any more steps that I need to take care of while dealing with an unsupervised class of data. I am doing this in python

$\endgroup$
3
  • $\begingroup$If you don't have any target variable and want to see how the data ( all features) are distributed you can use clustering like k-means or hierarchical.$\endgroup$CommentedJan 24, 2020 at 15:39
  • $\begingroup$That's correct. But what else before that? Do I need to extract the most important features first before I start with clustering? Do I need to encode all of them before this step? I am trying to understand the flow of steps here. What needs to be done first and what next$\endgroup$CommentedJan 24, 2020 at 15:45
  • 1
    $\begingroup$If you don't have a target variable , it's a bit difficult to get important features. So what I recommend is look at features which has very less to zero variance and drop them and yes you have to convert your categorical features using label or dummy .$\endgroup$CommentedJan 24, 2020 at 16:06

2 Answers 2

4
$\begingroup$

There cannot be a unique answer to your question. There is a discrepancy in your question though -

I am aware that this is a classification problem on which I am working on.

Could you please help me with the right step by step guide that I should follow in order to achieve an efficient clustering at the end?

However, I am assuming that you are trying to do clustering and you want methods that would give you mathematically better clusters.

clustering is an unsupervised learning problem that does not require target variables. The steps that you mentioned are pretty standard and theoretically correct but there are also other steps that you should take care of. I am listing a few :

  1. Selection of input features - Input features that go into a clustering algorithm are of great importance. It should be noted that a variable not containing any relevant information (say, the telephone number of each person) is worse than useless because it will make the clustering less apparent. In general, the selection of “good” variables is a nontrivial task and may involve quite some trial and error
  2. Selection of clustering algorithm - Use of a good clustering algorithm as per your data is an important step. For example, K- Means better work with numerical features, K- Modes with categorical and K- prototypes in case if you have the data which is a mix of numerical and categorical features.
$\endgroup$
2
  • $\begingroup$Thank you for your answer. Once I am done building the clusters what would be the next thing I need to focus on? Also, before doing clustering do I need to take care of scaling or encoding my data to numerical form or this should be done after I am done with my clustering?$\endgroup$CommentedJan 26, 2020 at 15:42
  • $\begingroup$@Django0602 Yes. Feature scaling is an important step before doing K-Means as K-Means uses distance measure at its core. Once you are done building clusters, you can perform EDA on your clusters and generate important insights from the same.$\endgroup$CommentedJan 26, 2020 at 16:26
0
$\begingroup$

If you are using a neural network for classification, here are a few things you can do on the data even if you don't have the labels for them. If the data points are real-valued vectors, you can normalize them by calculating the (featurewise) mean and standard deviation. You can train an autoencoder on this data (by reconstructing the original input), and this will get your a better feature representation, which can then be used with k-means clustering or other unsupervised methods.

$\endgroup$
2
  • $\begingroup$I am not using a neural network. The data is set simple, though it has a lot of categorical features so neural network might consume a lot of memory unnecessarily. Any other way that I can follow?$\endgroup$CommentedJan 24, 2020 at 16:05
  • $\begingroup$You can still normalize it channelwise if each feature has different range and all. Like year of birth would be in the range of 1900-2000s, body weight would be in a different range. You can use this new data for clustering and it should give better results. You can also prune out less probable data points from the dataset, if you are sure of it. This can also be used to tackle (possible) class imbalance. If you can describe the kind of classification that you expect to do, maybe we can discuss more.$\endgroup$CommentedJan 24, 2020 at 16:19

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.