0
$\begingroup$

I am trying to use CountVectorizer to obtain word numerical word representation of data which is essentialy list of 160000 English sentences:

import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer df_train = pd.read_csv('data/train.csv') vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1) X = vectorizer.fit_transform(list(df_train.text)) 

Then printing X:

>>> X <160000x693699 sparse matrix of type '<class 'numpy.int64'>' with 3721191 stored elements in Compressed Sparse Row format> 

But converting the whole to array to get the numerical word representation of all data gives:

>>> X.toarray() --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_11636/854451212.py in <module> ----> 1 X.toarray() c:\users\crrma\.virtualenvs\humor-detection-2-8vpiokuk\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out) 1037 if out is None and order is None: 1038 order = self._swap('cf')[0] -> 1039 out = self._process_toarray_args(order, out) 1040 if not (out.flags.c_contiguous or out.flags.f_contiguous): 1041 raise ValueError('Output array must be C or F contiguous') c:\users\crrma\.virtualenvs\humor-detection-2-8vpiokuk\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out) 1200 return out 1201 else: -> 1202 return np.zeros(self.shape, dtype=self.dtype, order=order) 1203 1204 MemoryError: Unable to allocate 827. GiB for an array with shape (160000, 693699) and data type int64 

For the example in the linked schikit learn doc page, they have used only five sentences. Thus, for them X.toarray() seem to have returned the array of numerical word representation. But since my dataset contains 160000 sentences, (in error message) it seems that it is resulting in vocabulary of size 693699 (which contains both unique unigrams and bigrams, due to ngram_range parameter passed to CountVectorizer) and hence facing insufficient memory issue.

Q1. How can I fix this? I am thinking to simply reject X and separately transform in mini batches as shown below. Is this correct?

>>> X_batch = list(df_train[:10].text) # do this for 160000 / batch_size batches >>> X_batch_encoding = vectorizer.transform(X_batch).toarray() >>> X_batch_encoding array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=int64) >>> X_batch_encoding[0].shape (693699,) 

Q2. I am thinking to train neural network and decision tree on this encoding for humor detection. But I guess it wont be great idea to have 693699 length vector to represent single sentence. Right? If yes, what should I do instead? Should I opt to use only unigrams while fitting CountVectorizer (but it will not capture even minimal context of words, unlike bigrams) ?

PS: I am creating baseline for humor detection, I am required to use CountVectorizer.

$\endgroup$

    3 Answers 3

    1
    $\begingroup$

    In addition to @Erwan's recommendations, consider the following

    1. Use a stopword list. This will reduce the number of features.

    2. Countvectorizer does not support batch learning. When you chunk your input and repeat the fit call, you are resetting the features. The calls are not additive.

    3. Finally and probably most importantly, why do you want to convert your sparse matrix to dense? This is what the .array() call will do. If you want to examine the features for a given sentence, use the transform method on the specific sentence and then convert to array.

    Most (I've not examined all) sklearn clustering and classification algorithms will take the sparse matrix directly.

    $\endgroup$
      1
      $\begingroup$

      As @Erwan said, use of bigrams is a source of huge vocabulary. So keeping ngram_range to default which 1 reduced the vocabulary size significantly.

      Further features [1] to reduce vocabulary size:

      • min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.

      • max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

      • vocabulary: Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. (In short, this allows to explicitly specify the vocabulary)

      There is also max_df anologus to min_df, but it might be less sensible. It might not be good idea to ignore highly frequent words.

      Just for a comparison, by setting just max_features to 2500 and training Multinomial naive bayes gave 85% accuracy.


      [1] CountVectorizer doc

      $\endgroup$
        0
        $\begingroup$

        The first thing you can do is to use min_df with a value of at least 2 or 3 instead of 1. This will greatly reduce the size of the vocabulary because there are a lot of words which appear rarely (due to Zipf's law). It won't affect performance negatively because these words are almost never useful, in fact it often increases performance due to the reduced noise.

        The use of bigrams is also a huge source of increased vocabulary size, of course. I think you could use unigrams only, or use a much higher min_df. To some extent the context of a sentence can be taken into account by the different features. It's true that it's not as precise but too much precision can cause overfitting.

        $\endgroup$

          Start asking to get answers

          Find the answer to your question by asking.

          Ask question

          Explore related questions

          See similar questions with these tags.