Skip to content

Latest commit

 

History

History
128 lines (98 loc) · 5.98 KB

datasets-viewer-sql-console.md

File metadata and controls

128 lines (98 loc) · 5.98 KB

SQL Console: Query Hugging Face datasets in your browser

You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by DuckDB WASM and runs entirely in the browser. You can access the SQL Console from the Data Studio.

To learn more about the SQL Console, see the SQL Console blog post.

Through the SQL Console, you can:

  • Run DuckDB SQL queries on the dataset (checkout SQL Snippets for useful queries)
  • Share results of the query with others via a link (check out this example)
  • Download the results of the query to a Parquet or CSV file
  • Embed the results of the query in your own webpage using an iframe
  • Query datasets with natural language
You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the DuckDB Datasets documentation for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI.

Examples

Filtering

The SQL Console makes filtering datasets really easy. For example, if you want to filter the SkunkworksAI/reasoning-0.01 dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:

Here's the SQL to sort by length of the reasoning

SELECT*FROM train WHERE LENGTH(reasoning_chains) >10;

Histogram

Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB histogram function, we can plot a histogram of a column's values.

For example, to plot a histogram of the Rating column in the Lichess/chess-puzzles dataset, you can use the following query:

Learn more about the `histogram` function and parameters here.

from histogram(train, Rating)

Regex Matching

One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the regexp function to match patterns in your data.

Using the regexp_matches function, we can filter the GeneralReasoning/GeneralThought-195k dataset for instructions that contain markdown code blocks.

Learn more about the DuckDB regex functions here.

SELECT*FROM train WHERE regexp_matches(model_answer, '```') LIMIT10;

Leakage Detection

Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.

Learn more about leakage detection here.

WITH overlapping_rows AS ( SELECT COALESCE( (SELECTCOUNT(*) AS overlap_count FROM train INTERSECT SELECTCOUNT(*) AS overlap_count FROM test), 0 ) AS overlap_count ), total_unique_rows AS ( SELECTCOUNT(*) AS total_count FROM ( SELECT*FROM train UNIONSELECT*FROM test ) combined ) SELECT overlap_count, total_count, CASE WHEN total_count >0 THEN (overlap_count *100.0/ total_count) ELSE 0 END AS overlap_percentage FROM overlapping_rows, total_unique_rows;
close