Questions tagged [bigdata]

Ask Question

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.

454 questions

1vote

1answer

31views

Calculating LOF for big data

I have big dataset (hundreds of millions of records, counted in dozens of GBs) and I would like to perform LOF for the problem of anomaly detection (testing different methods for academic purposes) ...

Asic

asked Jan 6 at 14:29

0votes

0answers

16views

String to number in case of having millions of unique values

I am currently working on preprocessing big data dataset for ML purposes. I am struggling with encoding strings as numbers. I have a dataset of multiple blockchain transactions and I have addresses of ...

Asic

asked Dec 17, 2024 at 16:03

0votes

0answers

14views

Stuck on loading parquet files recursively of varying size with Spark

I am using Spark on Scala via and Almond kernel for Jupyter to load several parquet files with varying size. I have a single worker with 10 cores and memory allowance of 10GB. When I execute the ...

Ícaro Lorran

asked Dec 11, 2024 at 9:12

0votes

1answer

24views

How to process big data in my case?

I have 54 data files generated by a simulation. Each file has 10 million rows, and each file is several GB in size. I need to read each file, compute their autocorrelation, and fit the curve. What is ...

user366312

asked Jun 10, 2024 at 22:02

0votes

0answers

37views

How to determine the best number of cores and memory for Spark job

How can we determine the optimal number of cores and memory for running Spark jobs based on data volume, the number of jobs, and their frequency? From what I've read, we can determine the number of ...

Keyser

asked Jun 7, 2024 at 5:28

0votes

0answers

113views

Fuzzy Name Matching with Machine Learning. Input data encoding

I have a huge dataset: Last name, first name, date of birth of Indian residents and I need to match them for similarity. The matching is fuzzy, the data looks like this (names are fictitious for the ...

ккк ккк

asked May 14, 2024 at 7:09

2votes

2answers

422views

How to deal with high data volumes? (Tools, techniques, concepts, etc.)

I have some doubts about how to deal with high volumes of data. I'm currently working in the data analysis/data science field, so I've had the chance to perform calculations, manipulate data, and ...

tms

asked Nov 15, 2023 at 15:34

1vote

0answers

21views

How to find own way in data field? [closed]

I need help finding my path in the data field. I just completed my MSc in Big Data at university. During my studies, I gained approximately Junior level Python and SQL programming skills, worked with ...

Gleb

asked Jun 23, 2023 at 10:21

0votes

2answers

55views

Data imputation for heavily missing features

I am currently working on the dataset IEEE-CIS Fraud Detection, provided via Kaggle, with around 350 features, with around 600k instances. However, some features are missing large amounts of values, ...

Hai Nguyen

asked Apr 9, 2023 at 19:40

0votes

1answer

40views

Can I update the source of Data found in a Data Lake or Data Blob

Is it possible to update the source of data found in a Data Lake or Data Blob? What about while using HDInsight or Azure Databricks?

JF0001

asked Dec 8, 2022 at 3:21

0votes

1answer

163views

Gensim: create a dictionary from a large corpus without loading it in RAM?

The topic modelling library Gensim offers the ability to stream a large document instead of storing it in memory. Streaming is possible for the stage of converting the corpus to BOW, but the ...

Erwan

26.2k

asked Nov 26, 2022 at 19:33

0votes

2answers

54views

What kinds of ML models should I use when the outcome variable does not vary with time but only vary across individuals and groups?

I am trying to predict individuals’ income in 2018 using 18 years worth of data for people who were born in 1978,1979, and 1980 on many variables such as family income, location, family members’ ...

Aman Desai

asked Nov 7, 2022 at 20:12

1vote

1answer

41views

What is the name of my problem - distribution of counts of elements having certain attribute

I have the following problem: There is a large set of records. Each record in the set has an attribute. For some values of the attribute, there is only one record, for other values there are many ...

danatel

asked Oct 25, 2022 at 12:54

2votes

3answers

90views

Is big data a fallacy if most phenomena can be mostly described by few variables?

Is big data a fallacy if most phenomena can be mostly described by few variables? This has confused me. Surely there are big data sets, but there are also cases when the set of significant or ...

mavavilj

asked Oct 25, 2022 at 8:49

0votes

2answers

71views

How to manage large datasets (approx 95GB)

I was planning some data analysis on a dataset I'll be using for some projects. The dataset in question is ZINC20. Now, I don't need the whole thing so I was going to write some functions that would ...

apvn

asked Oct 12, 2022 at 13:44

15 30 50per page

2 3 4 5

…

31 Next

Stack Exchange Network

Questions tagged [bigdata]

Calculating LOF for big data

String to number in case of having millions of unique values

Stuck on loading parquet files recursively of varying size with Spark

How to process big data in my case?

How to determine the best number of cores and memory for Spark job

Fuzzy Name Matching with Machine Learning. Input data encoding

How to deal with high data volumes? (Tools, techniques, concepts, etc.)

How to find own way in data field? [closed]

Data imputation for heavily missing features

Can I update the source of Data found in a Data Lake or Data Blob

Gensim: create a dictionary from a large corpus without loading it in RAM?

What kinds of ML models should I use when the outcome variable does not vary with time but only vary across individuals and groups?

What is the name of my problem - distribution of counts of elements having certain attribute

Is big data a fallacy if most phenomena can be mostly described by few variables?

How to manage large datasets (approx 95GB)

Hot Network Questions

Questions tagged [bigdata]

Related Tags