Questions tagged [sampling]
The sampling tag has no summary.
180 questions
2votes
0answers
56views
How do I train a model on data where there should be a statistical difference but it can't find it?
I'm trying to create a predictive model for a dataset with continuous input variables and a binary/probability output. The input are sensors (up to 400 columns, but some very irrelevant) which are ...
0votes
0answers
8views
Importance of resampling when establishing a cutoff for categorical data
I am reading Feature Engineering and Selection by Max Kuhn and Kjell Johnson, and on page 97, section 5.2 it has the following (my question is ref. the last sentence): 'Although near-zero variance ...
1vote
0answers
29views
Sampling multiple masked tokens through Metropolis–Hastings
I'm trying to replicate the finding of the the publication "Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis-Hastings" for obtaining the joint distribution ...
1vote
0answers
7views
Optimizing Sampling Strategy to Enhance Uniformity Under Conditional Constraints
I am facing a challenge in a project that involves sampling from a design space defined by 10 variables. I use Latin Hypercube Sampling (LHS) and/or Sobol sequences, and initially, the samples are ...
4votes
1answer
49views
Algorithm for picking N random uniformly distributed samples, in irregular polygon?
Say want to pick a fixed number of samples from a large 2D dataset, such that they relatively evenly distributed over the whole sample area. Imagine places in a country - so the border of the data is ...
1vote
1answer
227views
Top_p parameter in langchain
I am trying to understand the top_p parameter in langchain (nucleus sampling) but I can't seem to grasp it. Based on this we sort the probabilities and select a ...
1vote
1answer
160views
Correct way to take a subset of a dataset?
I am attempting a binary classification problem (using Weka). My dataset has 100,000 rows, 14 attributes (1 output variable). It takes already too long just to open the dataset in excel so I just know ...
1vote
1answer
3kviews
Why is 0.7, in general, the default value of temperature for LLMs?
I have recently read through a lot of documentation and articles about Large Language Models (LLMs), and I have come to the conclusion that 0.7 is, most of the time, the default value for the ...
0votes
1answer
49views
how to evaluate a model on our data when the model is imported from a library and thus not trained by us?
The company I work for has deployed a trained rule-based sentiment analyzer model vader to make predictions on customer's attitude. We import the model from nltk library directly, so we didn't train ...
1vote
0answers
28views
Calculating an integral with as few grid points as possible
Suppose I have a function $f\colon [0,1] \to \mathbb{R}$ which is maybe continuous (it's at least in $L^1$). I have a sample of $N$ points $\{x_i\}$ taken from the domain $[0,1]$ randomly from some ...
0votes
1answer
58views
Question about collapsing variable and oversampling minority classes
i have imbalanced data consisting of nine classes, and i am planning to collapse them into two classes. i performed stratified (proportionate) sampling between test, validation, and training sets ...
1vote
0answers
13views
Group or find associations and orderings for elements that appear in different samples (analyzing examples of input files for undocumented code)
I'm trying to understand and use a physics simulation code that was written decades ago. It uses input files that have their origins in stacks of punch cards as input. In other words each line is a ...
0votes
1answer
168views
Is Logistic Regression possible using a Convenience Sample?
I've collected some survey data on homeless individuals, surveying their drug use, education level, age, gender etc. I hope to run a logistic regression to see how impactful homelessness (+other ...
0votes
1answer
243views
Understanding bootstrapping in bias variance decomposition
I was going through bias and variance tradeoff article and it makes use of bias_variance_decomp function from mlxtend library. ...
0votes
1answer
73views
Determining the information loss due to undersampling
I have an image dataset that I need to segment into directories (train, validation and test) using ImageDataGenerator in TensorFlow/Keras. The dataset is highly imbalanced: For this I have decided to ...