Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

129 questions

3votes

1answer

71views

Constant feature ignored by Spark LinearRegression?

I am running a linear regression model using PySpark, and came across following weird behavior: When I include a constant feature (representing an intercept term), it is ignored completely by Spark. I....

Achrbot

asked Mar 21 at 11:47

3votes

1answer

64views

using Standardization and Normalization in the same pipeline

I have a pyspark ML pipeline that uses PCA reduction and ANN. My understanding is that PCA performs best when given standardized values while NN perform best when given normalized values. Does it ...

Mike Pone

asked Feb 14 at 16:59

0votes

0answers

16views

String to number in case of having millions of unique values

I am currently working on preprocessing big data dataset for ML purposes. I am struggling with encoding strings as numbers. I have a dataset of multiple blockchain transactions and I have addresses of ...

Asic

asked Dec 17, 2024 at 16:03

0votes

0answers

23views

Copy column of PySpark Levenshtein distances into upper/lower triangular array without Python looping

I am following this example to generate 450+ million Levenshtein distances between 30,011 text labels of up to 20 characters. The problem is that the distance values are organized into a single ...

user2153235

asked Jun 14, 2024 at 20:04

0votes

0answers

80views

Avoid memory use by text labels in PySpark's Levenshtein distance calculation

I have a pandas Series of 30,011 text ID labels of up to 20 characters each. These are manually typed, so there are many typos and creative embellishments applied on an inconsistent basis. I plan to ...

user2153235

asked Jun 14, 2024 at 16:22

0votes

0answers

37views

How to determine the best number of cores and memory for Spark job

How can we determine the optimal number of cores and memory for running Spark jobs based on data volume, the number of jobs, and their frequency? From what I've read, we can determine the number of ...

Keyser

asked Jun 7, 2024 at 5:28

1vote

1answer

48views

Predicted output is only 0s

I am developing a neural network using Home credit Default Risk Dataset. The prediction should be between 0.0 and 1.0 but my algorithm's outcome is just 0.0 for every row. My Code ...

Erevos

asked Jun 6, 2024 at 8:07

0votes

0answers

68views

Why PySpark `BinaryClasssificationEvaluator` metric `areaUnderROC` returns slightly different across multiple evaluations on the same dataset?

I am using BinaryClasssificationEvaluator in Pyspark to calculate AUC, however, I find that the returned auc across multiple evaluations on the same dataset are ...

helloworld

asked Jun 3, 2024 at 11:38

0votes

0answers

15views

jar files downloading very slowly in jupyter notebook in Mac Book(M2 pro)

Required jar files are downloading from maven repository in Jupyter notebook are very slow in Mac book (M2 pro). how can i increase the speed of download?

Tovlk

asked May 6, 2024 at 10:13

0votes

1answer

75views

Any Interface/Library that can take the Python ML code and run on spark cluster without learning PySpark?

I have been working with Python for machine learning and have a fair amount of code written in Python using libraries such as scikit-learn, pandas, and numpy. Recently, I’ve been faced with larger ...

Mohith7548

asked Nov 30, 2023 at 7:48

1vote

1answer

898views

convert a date string to utc timezone in pyspark

My input is "20220212" and I should get output like "2022-02-12T00:00:00+00:00" I have written the following ...

Monisha Sivanathan

asked Nov 16, 2023 at 11:45

1vote

1answer

177views

Is it valid to use Spark's StandardScaler on sparse input?

While I know it's possible to use StandardScaler on a SparseVector column, I wonder now if this is a valid transformation. My reason is that the output (most likely) will not be sparse. For example, ...

user12138762

asked Dec 1, 2022 at 14:39

1vote

0answers

54views

Not able to read data from Mongodb for below schema [closed]

Am trying to read very complex json from mongoDB. Tried in multiple ways nut no luck. Sample schema below : ...

sai

asked Oct 17, 2022 at 11:31

1vote

0answers

156views

Why is GPU accelerated node much slower than CPU node for training a random forest model on databricks?

I have a dataset about 5 million rows with 14 features and a binary target. I decided to train a pyspark random forest classifier on Databricks. The CPU cluster I created contains 2 c4.8xlarge workers ...

Zhenyu Zhang

asked Oct 14, 2022 at 4:14

2votes

0answers

2kviews

Proper way to store a dict as a column value in parquet format using pyspark

I have a requirement to store a nested list of json objects in a column by doing a JOIN between two datasets related by one-to-many relation. Example: stackoverflow posts (each question can have one ...

vangap

asked Oct 12, 2022 at 0:48

15 30 50per page

2 3 4 5

…

9 Next

Stack Exchange Network

Questions tagged [pyspark]

Constant feature ignored by Spark LinearRegression?

using Standardization and Normalization in the same pipeline

String to number in case of having millions of unique values

Copy column of PySpark Levenshtein distances into upper/lower triangular array without Python looping

Avoid memory use by text labels in PySpark's Levenshtein distance calculation

How to determine the best number of cores and memory for Spark job

Predicted output is only 0s

Why PySpark `BinaryClasssificationEvaluator` metric `areaUnderROC` returns slightly different across multiple evaluations on the same dataset?

jar files downloading very slowly in jupyter notebook in Mac Book(M2 pro)

Any Interface/Library that can take the Python ML code and run on spark cluster without learning PySpark?

convert a date string to utc timezone in pyspark

Is it valid to use Spark's StandardScaler on sparse input?

Not able to read data from Mongodb for below schema [closed]

Why is GPU accelerated node much slower than CPU node for training a random forest model on databricks?

Proper way to store a dict as a column value in parquet format using pyspark

Hot Network Questions

Questions tagged [pyspark]

Related Tags