Questions tagged [pyspark]
The Spark Python API (PySpark) exposes the apache-spark programming model to Python.
129 questions
3votes
1answer
71views
Constant feature ignored by Spark LinearRegression?
I am running a linear regression model using PySpark, and came across following weird behavior: When I include a constant feature (representing an intercept term), it is ignored completely by Spark. I....
3votes
1answer
64views
using Standardization and Normalization in the same pipeline
I have a pyspark ML pipeline that uses PCA reduction and ANN. My understanding is that PCA performs best when given standardized values while NN perform best when given normalized values. Does it ...
0votes
0answers
16views
String to number in case of having millions of unique values
I am currently working on preprocessing big data dataset for ML purposes. I am struggling with encoding strings as numbers. I have a dataset of multiple blockchain transactions and I have addresses of ...
0votes
0answers
23views
Copy column of PySpark Levenshtein distances into upper/lower triangular array without Python looping
I am following this example to generate 450+ million Levenshtein distances between 30,011 text labels of up to 20 characters. The problem is that the distance values are organized into a single ...
0votes
0answers
80views
Avoid memory use by text labels in PySpark's Levenshtein distance calculation
I have a pandas Series of 30,011 text ID labels of up to 20 characters each. These are manually typed, so there are many typos and creative embellishments applied on an inconsistent basis. I plan to ...
0votes
0answers
37views
How to determine the best number of cores and memory for Spark job
How can we determine the optimal number of cores and memory for running Spark jobs based on data volume, the number of jobs, and their frequency? From what I've read, we can determine the number of ...
1vote
1answer
48views
Predicted output is only 0s
I am developing a neural network using Home credit Default Risk Dataset. The prediction should be between 0.0 and 1.0 but my algorithm's outcome is just 0.0 for every row. My Code ...
0votes
0answers
68views
Why PySpark `BinaryClasssificationEvaluator` metric `areaUnderROC` returns slightly different across multiple evaluations on the same dataset?
I am using BinaryClasssificationEvaluator in Pyspark to calculate AUC, however, I find that the returned auc across multiple evaluations on the same dataset are ...
0votes
0answers
15views
jar files downloading very slowly in jupyter notebook in Mac Book(M2 pro)
Required jar files are downloading from maven repository in Jupyter notebook are very slow in Mac book (M2 pro). how can i increase the speed of download?
0votes
1answer
75views
Any Interface/Library that can take the Python ML code and run on spark cluster without learning PySpark?
I have been working with Python for machine learning and have a fair amount of code written in Python using libraries such as scikit-learn, pandas, and numpy. Recently, I’ve been faced with larger ...
1vote
1answer
898views
convert a date string to utc timezone in pyspark
My input is "20220212" and I should get output like "2022-02-12T00:00:00+00:00" I have written the following ...
1vote
1answer
177views
Is it valid to use Spark's StandardScaler on sparse input?
While I know it's possible to use StandardScaler on a SparseVector column, I wonder now if this is a valid transformation. My reason is that the output (most likely) will not be sparse. For example, ...
1vote
0answers
54views
Not able to read data from Mongodb for below schema [closed]
Am trying to read very complex json from mongoDB. Tried in multiple ways nut no luck. Sample schema below : ...
1vote
0answers
156views
Why is GPU accelerated node much slower than CPU node for training a random forest model on databricks?
I have a dataset about 5 million rows with 14 features and a binary target. I decided to train a pyspark random forest classifier on Databricks. The CPU cluster I created contains 2 c4.8xlarge workers ...
2votes
0answers
2kviews
Proper way to store a dict as a column value in parquet format using pyspark
I have a requirement to store a nested list of json objects in a column by doing a JOIN between two datasets related by one-to-many relation. Example: stackoverflow posts (each question can have one ...