Skip to main content

Questions tagged [sampling]

2votes
0answers
56views

How do I train a model on data where there should be a statistical difference but it can't find it?

I'm trying to create a predictive model for a dataset with continuous input variables and a binary/probability output. The input are sensors (up to 400 columns, but some very irrelevant) which are ...
user46124's user avatar
0votes
0answers
8views

Importance of resampling when establishing a cutoff for categorical data

I am reading Feature Engineering and Selection by Max Kuhn and Kjell Johnson, and on page 97, section 5.2 it has the following (my question is ref. the last sentence): 'Although near-zero variance ...
horned-sphere's user avatar
1vote
0answers
29views

Sampling multiple masked tokens through Metropolis–Hastings

I'm trying to replicate the finding of the the publication "Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis-Hastings" for obtaining the joint distribution ...
Chris's user avatar
1vote
0answers
7views

Optimizing Sampling Strategy to Enhance Uniformity Under Conditional Constraints

I am facing a challenge in a project that involves sampling from a design space defined by 10 variables. I use Latin Hypercube Sampling (LHS) and/or Sobol sequences, and initially, the samples are ...
Chris's user avatar
4votes
1answer
49views

Algorithm for picking N random uniformly distributed samples, in irregular polygon?

Say want to pick a fixed number of samples from a large 2D dataset, such that they relatively evenly distributed over the whole sample area. Imagine places in a country - so the border of the data is ...
barryhunter's user avatar
1vote
1answer
227views

Top_p parameter in langchain

I am trying to understand the top_p parameter in langchain (nucleus sampling) but I can't seem to grasp it. Based on this we sort the probabilities and select a ...
Labyrinthian's user avatar
1vote
1answer
160views

Correct way to take a subset of a dataset?

I am attempting a binary classification problem (using Weka). My dataset has 100,000 rows, 14 attributes (1 output variable). It takes already too long just to open the dataset in excel so I just know ...
FlexMcMurphy's user avatar
1vote
1answer
3kviews

Why is 0.7, in general, the default value of temperature for LLMs?

I have recently read through a lot of documentation and articles about Large Language Models (LLMs), and I have come to the conclusion that 0.7 is, most of the time, the default value for the ...
jmpion's user avatar
0votes
1answer
49views

how to evaluate a model on our data when the model is imported from a library and thus not trained by us?

The company I work for has deployed a trained rule-based sentiment analyzer model vader to make predictions on customer's attitude. We import the model from nltk library directly, so we didn't train ...
Shelby's user avatar
1vote
0answers
28views

Calculating an integral with as few grid points as possible

Suppose I have a function $f\colon [0,1] \to \mathbb{R}$ which is maybe continuous (it's at least in $L^1$). I have a sample of $N$ points $\{x_i\}$ taken from the domain $[0,1]$ randomly from some ...
math_guy's user avatar
0votes
1answer
58views

Question about collapsing variable and oversampling minority classes

i have imbalanced data consisting of nine classes, and i am planning to collapse them into two classes. i performed stratified (proportionate) sampling between test, validation, and training sets ...
RyRy the Fly Guy's user avatar
1vote
0answers
13views

Group or find associations and orderings for elements that appear in different samples (analyzing examples of input files for undocumented code)

I'm trying to understand and use a physics simulation code that was written decades ago. It uses input files that have their origins in stacks of punch cards as input. In other words each line is a ...
uhoh's user avatar
  • 121
0votes
1answer
168views

Is Logistic Regression possible using a Convenience Sample?

I've collected some survey data on homeless individuals, surveying their drug use, education level, age, gender etc. I hope to run a logistic regression to see how impactful homelessness (+other ...
JS Holding's user avatar
0votes
1answer
243views

Understanding bootstrapping in bias variance decomposition

I was going through bias and variance tradeoff article and it makes use of bias_variance_decomp function from mlxtend library. ...
Mahesha999's user avatar
0votes
1answer
73views

Determining the information loss due to undersampling

I have an image dataset that I need to segment into directories (train, validation and test) using ImageDataGenerator in TensorFlow/Keras. The dataset is highly imbalanced: For this I have decided to ...
Harsh Khare's user avatar

153050per page
close