Skip to main content

Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

3votes
1answer
71views

Constant feature ignored by Spark LinearRegression?

I am running a linear regression model using PySpark, and came across following weird behavior: When I include a constant feature (representing an intercept term), it is ignored completely by Spark. I....
Achrbot's user avatar
3votes
1answer
64views

using Standardization and Normalization in the same pipeline

I have a pyspark ML pipeline that uses PCA reduction and ANN. My understanding is that PCA performs best when given standardized values while NN perform best when given normalized values. Does it ...
Mike Pone's user avatar
0votes
0answers
16views

String to number in case of having millions of unique values

I am currently working on preprocessing big data dataset for ML purposes. I am struggling with encoding strings as numbers. I have a dataset of multiple blockchain transactions and I have addresses of ...
Asic's user avatar
0votes
0answers
23views

Copy column of PySpark Levenshtein distances into upper/lower triangular array without Python looping

I am following this example to generate 450+ million Levenshtein distances between 30,011 text labels of up to 20 characters. The problem is that the distance values are organized into a single ...
user2153235's user avatar
0votes
0answers
80views

Avoid memory use by text labels in PySpark's Levenshtein distance calculation

I have a pandas Series of 30,011 text ID labels of up to 20 characters each. These are manually typed, so there are many typos and creative embellishments applied on an inconsistent basis. I plan to ...
user2153235's user avatar
0votes
0answers
37views

How to determine the best number of cores and memory for Spark job

How can we determine the optimal number of cores and memory for running Spark jobs based on data volume, the number of jobs, and their frequency? From what I've read, we can determine the number of ...
Keyser's user avatar
1vote
1answer
48views

Predicted output is only 0s

I am developing a neural network using Home credit Default Risk Dataset. The prediction should be between 0.0 and 1.0 but my algorithm's outcome is just 0.0 for every row. My Code ...
Erevos's user avatar
0votes
0answers
68views

Why PySpark `BinaryClasssificationEvaluator` metric `areaUnderROC` returns slightly different across multiple evaluations on the same dataset?

I am using BinaryClasssificationEvaluator in Pyspark to calculate AUC, however, I find that the returned auc across multiple evaluations on the same dataset are ...
helloworld's user avatar
0votes
0answers
15views

jar files downloading very slowly in jupyter notebook in Mac Book(M2 pro)

Required jar files are downloading from maven repository in Jupyter notebook are very slow in Mac book (M2 pro). how can i increase the speed of download?
Tovlk's user avatar
0votes
1answer
75views

Any Interface/Library that can take the Python ML code and run on spark cluster without learning PySpark?

I have been working with Python for machine learning and have a fair amount of code written in Python using libraries such as scikit-learn, pandas, and numpy. Recently, I’ve been faced with larger ...
Mohith7548's user avatar
1vote
1answer
898views

convert a date string to utc timezone in pyspark

My input is "20220212" and I should get output like "2022-02-12T00:00:00+00:00" I have written the following ...
Monisha Sivanathan's user avatar
1vote
1answer
177views

Is it valid to use Spark's StandardScaler on sparse input?

While I know it's possible to use StandardScaler on a SparseVector column, I wonder now if this is a valid transformation. My reason is that the output (most likely) will not be sparse. For example, ...
user12138762's user avatar
1vote
0answers
54views

Not able to read data from Mongodb for below schema [closed]

Am trying to read very complex json from mongoDB. Tried in multiple ways nut no luck. Sample schema below : ...
sai's user avatar
  • 11
1vote
0answers
156views

Why is GPU accelerated node much slower than CPU node for training a random forest model on databricks?

I have a dataset about 5 million rows with 14 features and a binary target. I decided to train a pyspark random forest classifier on Databricks. The CPU cluster I created contains 2 c4.8xlarge workers ...
Zhenyu Zhang's user avatar
2votes
0answers
2kviews

Proper way to store a dict as a column value in parquet format using pyspark

I have a requirement to store a nested list of json objects in a column by doing a JOIN between two datasets related by one-to-many relation. Example: stackoverflow posts (each question can have one ...
vangap's user avatar

153050per page
close