Skip to main content

Questions tagged [bigdata]

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.

1vote
1answer
31views

Calculating LOF for big data

I have big dataset (hundreds of millions of records, counted in dozens of GBs) and I would like to perform LOF for the problem of anomaly detection (testing different methods for academic purposes) ...
Asic's user avatar
0votes
0answers
16views

String to number in case of having millions of unique values

I am currently working on preprocessing big data dataset for ML purposes. I am struggling with encoding strings as numbers. I have a dataset of multiple blockchain transactions and I have addresses of ...
Asic's user avatar
0votes
0answers
14views

Stuck on loading parquet files recursively of varying size with Spark

I am using Spark on Scala via and Almond kernel for Jupyter to load several parquet files with varying size. I have a single worker with 10 cores and memory allowance of 10GB. When I execute the ...
Ícaro Lorran's user avatar
0votes
1answer
24views

How to process big data in my case?

I have 54 data files generated by a simulation. Each file has 10 million rows, and each file is several GB in size. I need to read each file, compute their autocorrelation, and fit the curve. What is ...
user366312's user avatar
0votes
0answers
37views

How to determine the best number of cores and memory for Spark job

How can we determine the optimal number of cores and memory for running Spark jobs based on data volume, the number of jobs, and their frequency? From what I've read, we can determine the number of ...
Keyser's user avatar
0votes
0answers
113views

Fuzzy Name Matching with Machine Learning. Input data encoding

I have a huge dataset: Last name, first name, date of birth of Indian residents and I need to match them for similarity. The matching is fuzzy, the data looks like this (names are fictitious for the ...
ккк ккк's user avatar
2votes
2answers
422views

How to deal with high data volumes? (Tools, techniques, concepts, etc.)

I have some doubts about how to deal with high volumes of data. I'm currently working in the data analysis/data science field, so I've had the chance to perform calculations, manipulate data, and ...
tms's user avatar
  • 31
1vote
0answers
21views

How to find own way in data field? [closed]

I need help finding my path in the data field. I just completed my MSc in Big Data at university. During my studies, I gained approximately Junior level Python and SQL programming skills, worked with ...
Gleb's user avatar
0votes
2answers
55views

Data imputation for heavily missing features

I am currently working on the dataset IEEE-CIS Fraud Detection, provided via Kaggle, with around 350 features, with around 600k instances. However, some features are missing large amounts of values, ...
Hai Nguyen's user avatar
0votes
1answer
40views

Can I update the source of Data found in a Data Lake or Data Blob

Is it possible to update the source of data found in a Data Lake or Data Blob? What about while using HDInsight or Azure Databricks?
JF0001's user avatar
0votes
1answer
163views

Gensim: create a dictionary from a large corpus without loading it in RAM?

The topic modelling library Gensim offers the ability to stream a large document instead of storing it in memory. Streaming is possible for the stage of converting the corpus to BOW, but the ...
Erwan's user avatar
  • 26.2k
0votes
2answers
54views

What kinds of ML models should I use when the outcome variable does not vary with time but only vary across individuals and groups?

I am trying to predict individuals’ income in 2018 using 18 years worth of data for people who were born in 1978,1979, and 1980 on many variables such as family income, location, family members’ ...
Aman Desai's user avatar
1vote
1answer
41views

What is the name of my problem - distribution of counts of elements having certain attribute

I have the following problem: There is a large set of records. Each record in the set has an attribute. For some values of the attribute, there is only one record, for other values there are many ...
danatel's user avatar
2votes
3answers
90views

Is big data a fallacy if most phenomena can be mostly described by few variables?

Is big data a fallacy if most phenomena can be mostly described by few variables? This has confused me. Surely there are big data sets, but there are also cases when the set of significant or ...
mavavilj's user avatar
0votes
2answers
71views

How to manage large datasets (approx 95GB)

I was planning some data analysis on a dataset I'll be using for some projects. The dataset in question is ZINC20. Now, I don't need the whole thing so I was going to write some functions that would ...
apvn's user avatar

153050per page
close