Questions tagged [bigdata]
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.
454 questions
1vote
1answer
31views
Calculating LOF for big data
I have big dataset (hundreds of millions of records, counted in dozens of GBs) and I would like to perform LOF for the problem of anomaly detection (testing different methods for academic purposes) ...
0votes
0answers
16views
String to number in case of having millions of unique values
I am currently working on preprocessing big data dataset for ML purposes. I am struggling with encoding strings as numbers. I have a dataset of multiple blockchain transactions and I have addresses of ...
0votes
0answers
14views
Stuck on loading parquet files recursively of varying size with Spark
I am using Spark on Scala via and Almond kernel for Jupyter to load several parquet files with varying size. I have a single worker with 10 cores and memory allowance of 10GB. When I execute the ...
0votes
1answer
24views
How to process big data in my case?
I have 54 data files generated by a simulation. Each file has 10 million rows, and each file is several GB in size. I need to read each file, compute their autocorrelation, and fit the curve. What is ...
0votes
0answers
37views
How to determine the best number of cores and memory for Spark job
How can we determine the optimal number of cores and memory for running Spark jobs based on data volume, the number of jobs, and their frequency? From what I've read, we can determine the number of ...
0votes
0answers
113views
Fuzzy Name Matching with Machine Learning. Input data encoding
I have a huge dataset: Last name, first name, date of birth of Indian residents and I need to match them for similarity. The matching is fuzzy, the data looks like this (names are fictitious for the ...
2votes
2answers
422views
How to deal with high data volumes? (Tools, techniques, concepts, etc.)
I have some doubts about how to deal with high volumes of data. I'm currently working in the data analysis/data science field, so I've had the chance to perform calculations, manipulate data, and ...
1vote
0answers
21views
How to find own way in data field? [closed]
I need help finding my path in the data field. I just completed my MSc in Big Data at university. During my studies, I gained approximately Junior level Python and SQL programming skills, worked with ...
0votes
2answers
55views
Data imputation for heavily missing features
I am currently working on the dataset IEEE-CIS Fraud Detection, provided via Kaggle, with around 350 features, with around 600k instances. However, some features are missing large amounts of values, ...
0votes
1answer
40views
Can I update the source of Data found in a Data Lake or Data Blob
Is it possible to update the source of data found in a Data Lake or Data Blob? What about while using HDInsight or Azure Databricks?
0votes
1answer
163views
Gensim: create a dictionary from a large corpus without loading it in RAM?
The topic modelling library Gensim offers the ability to stream a large document instead of storing it in memory. Streaming is possible for the stage of converting the corpus to BOW, but the ...
0votes
2answers
54views
What kinds of ML models should I use when the outcome variable does not vary with time but only vary across individuals and groups?
I am trying to predict individuals’ income in 2018 using 18 years worth of data for people who were born in 1978,1979, and 1980 on many variables such as family income, location, family members’ ...
1vote
1answer
41views
What is the name of my problem - distribution of counts of elements having certain attribute
I have the following problem: There is a large set of records. Each record in the set has an attribute. For some values of the attribute, there is only one record, for other values there are many ...
2votes
3answers
90views
Is big data a fallacy if most phenomena can be mostly described by few variables?
Is big data a fallacy if most phenomena can be mostly described by few variables? This has confused me. Surely there are big data sets, but there are also cases when the set of significant or ...
0votes
2answers
71views
How to manage large datasets (approx 95GB)
I was planning some data analysis on a dataset I'll be using for some projects. The dataset in question is ZINC20. Now, I don't need the whole thing so I was going to write some functions that would ...