0
$\begingroup$

I'm trying to pre-compute similarity scores for a 350k record dataset, but calculations are very slow due to the number of categories, time similarity calculation, and string similarity processing. I've already implemented several optimizations including GPU acceleration, setting a lower bound threshold, calculating only the upper half of the similarity matrix, and using float32 data types. Despite these efforts, the process is still estimated to take days to complete on a single A100 machine. Any suggestions on how to further improve performance so it can finish in hours rather than days?

part of code:

def calculate_time_similarity(date1: str, date2: str, time_decay_factor: float = 0.1) -> float: """ Calculate similarity between two dates with exponential decay. Args: date1: First release date (format: YYYY/MM/DD) date2: Second release date (format: YYYY/MM/DD) time_decay_factor: Controls how quickly similarity decays with time difference Returns: Similarity score between 0 and 1 """ try: d1 = datetime.strptime(date1, '%Y/%m/%d') d2 = datetime.strptime(date2, '%Y/%m/%d') # Calculate time difference in years time_diff = abs((d1 - d2).days) / 365.0 # Exponential decay formula: exp(-λt) similarity = np.exp(-time_decay_factor * time_diff) return similarity except (ValueError, TypeError): # Return 0 similarity if dates are invalid return 0.0 def prepare_text_features(df: pd.DataFrame) -> np.ndarray: """ Process text columns using TF-IDF vectorization. Args: df: DataFrame containing text columns Returns: Combined text feature matrix """ text_columns = ['description', 'title', 'series'] text_features = {} for col in text_columns: # Initialize TF-IDF vectorizer with Japanese tokenizer tfidf = TfidfVectorizer( tokenizer=tokenize_japanese, max_features=1000, # Limit features to prevent dimensionality explosion min_df=2, # Minimum document frequency max_df=0.95 # Maximum document frequency ) # Fill NA values with empty string text_series = df[col].fillna('') # Fit and transform the text data text_features[col] = tfidf.fit_transform(text_series) # Combine text features using horizontal stacking combined_text_features = np.hstack([ matrix.toarray() for matrix in text_features.values() ]) return combined_text_features def prepare_categorical_features(df: pd.DataFrame) -> np.ndarray: """ Prepare categorical features for content-based recommendation. Args: df: DataFrame containing categorical features Returns: Combined categorical feature matrix """ feature_matrices = {} # Process list-based columns list_columns = ['hashtags', 'genre', 'performer'] for col in list_columns: mlb = MultiLabelBinarizer(sparse_output=False) # Changed to False for dense output processed_col = df[col].apply(lambda x: [] if pd.isna(x) or x == [] else [str(item).strip() for item in eval(x) if str(item).strip()]) feature_matrices[col] = mlb.fit_transform(processed_col) # Process single-value columns single_columns = ['director', 'label', 'maker', 'series'] for col in single_columns: dummies = pd.get_dummies(df[col], prefix=col, dummy_na=True) feature_matrices[col] = dummies.values # Combine all categorical features combined_features = np.hstack([matrix for matrix in feature_matrices.values()]) return combined_features 
$\endgroup$

    0

    Start asking to get answers

    Find the answer to your question by asking.

    Ask question

    Explore related questions

    See similar questions with these tags.