I am trying to use CountVectorizer
to obtain word numerical word representation of data which is essentialy list of 160000 English sentences:
import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer df_train = pd.read_csv('data/train.csv') vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1) X = vectorizer.fit_transform(list(df_train.text))
Then printing X
:
>>> X <160000x693699 sparse matrix of type '<class 'numpy.int64'>' with 3721191 stored elements in Compressed Sparse Row format>
But converting the whole to array to get the numerical word representation of all data gives:
>>> X.toarray() --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_11636/854451212.py in <module> ----> 1 X.toarray() c:\users\crrma\.virtualenvs\humor-detection-2-8vpiokuk\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out) 1037 if out is None and order is None: 1038 order = self._swap('cf')[0] -> 1039 out = self._process_toarray_args(order, out) 1040 if not (out.flags.c_contiguous or out.flags.f_contiguous): 1041 raise ValueError('Output array must be C or F contiguous') c:\users\crrma\.virtualenvs\humor-detection-2-8vpiokuk\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out) 1200 return out 1201 else: -> 1202 return np.zeros(self.shape, dtype=self.dtype, order=order) 1203 1204 MemoryError: Unable to allocate 827. GiB for an array with shape (160000, 693699) and data type int64
For the example in the linked schikit learn doc page, they have used only five sentences. Thus, for them X.toarray()
seem to have returned the array of numerical word representation. But since my dataset contains 160000 sentences, (in error message) it seems that it is resulting in vocabulary of size 693699 (which contains both unique unigrams and bigrams, due to ngram_range
parameter passed to CountVectorizer
) and hence facing insufficient memory issue.
Q1. How can I fix this? I am thinking to simply reject X
and separately transform in mini batches as shown below. Is this correct?
>>> X_batch = list(df_train[:10].text) # do this for 160000 / batch_size batches >>> X_batch_encoding = vectorizer.transform(X_batch).toarray() >>> X_batch_encoding array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=int64) >>> X_batch_encoding[0].shape (693699,)
Q2. I am thinking to train neural network and decision tree on this encoding for humor detection. But I guess it wont be great idea to have 693699 length vector to represent single sentence. Right? If yes, what should I do instead? Should I opt to use only unigrams while fitting CountVectorizer
(but it will not capture even minimal context of words, unlike bigrams) ?
PS: I am creating baseline for humor detection, I am required to use CountVectorizer
.