Skip to main content

Analyzing Hugging Face Datasets with Databend

Introduction

Hugging Face is currently the most popular AI community globally, fostering innovation and collaboration among data scientists and businesses in various aspects such as models, datasets, and applications. Hugging Face hosts a diverse range of datasets that serve not only as samples for learning and practice but also as foundational sources for enterprise model data.

In order to maximize the utility of these datasets, it is often necessary to cleanse the raw data within them and archive it in a data lake to provide a unified access point. Databend is designed with ETL/ELT-centric workflows in mind, allowing for immediate cleansing, transformation, and merging of data upon loading. It supports a wealth of structured and semi-structured data types, enabling direct querying and analysis of raw data in multiple formats. The enterprise edition further supports advanced features such as virtual columns and computed columns, ensuring that the data is always ready for use.

In this article, we will demonstrate how to effortlessly access datasets hosted on Hugging Face using Databend and perform simple yet efficient analysis and processing with SQL. Additionally, the examples in this post include a white-box model implemented in SQL, showcasing how to conduct category predictions within the data warehouse and validate the model's accuracy.

Accessing Hugging Face Datasets

Hugging Face offers a publicly accessible REST API, which supports the automatic conversion of dataset files to Parquet upon upload, enabling easy integration with databases and data analysis tools.

The dataset page is presented below, where through the Dataset card, one can gain an intuitive understanding of the data and obtain a preview, while the Files and Versions page helps to understand the directory structure and revision history.

Image

Databend provides several ways to integrate with Hugging Face, and in the following sections, we will use fka/awesome-chatgpt-prompts, a popular GPT Prompts dataset containing 153 high-quality prompts and their corresponding scenarios, as an example to illustrate how to access Hugging Face datasets using Databend.

External Stage/Location

Many databases now support direct access to remote data files via a remote URL. This mode is applicable to Hugging Face, as well as other file services that enable HTTPS access. However, direct access via remote URLs has some drawbacks, such as unexpected responses, incorrect escaping and encoding/decoding, and the need for complex JSON parsing to locate files.

In Databend, you can directly query the datasets in the Hugging Face or mount the filesystem as an external Location or Stage, effectively avoiding access issues and facilitating data querying.

For Hugging Face, the applicable externalLocation parameter is as follows:

externalLocation ::=
"hf://<repo_id>[<path>]"
CONNECTION = (
<connection_parameters>
)

URI Format: hf://{repo_id}/path/to/file, where repo_id is similar to fka/awesome-chatgpt-prompts. Supported configurations include:

  • repo_type: The type of Hugging Face repository, with the default being dataset. Available options are dataset, model.
  • revision: The revision version in Hugging Face, defaulting to main. It can be a branch, tag, or commit in the repository.
  • token: The API token for Hugging Face.

Querying the raw CSV data file as an external Location

The following example shows how to use Databend to directly query the raw CSV data file of fka/awesome-chatgpt-prompts and list the first 5 rows of the first column in the CSV file.

SELECT $1 FROM 'hf://fka/awesome-chatgpt-prompts/' (PATTERN => '.*csv' FILE_FORMAT => 'CSV') LIMIT 5;

Image

Querying the transformed Parquet data file as an external Stage

As mentioned earlier, Hugging Face provides Parquet files that have been converted, and Databend also supports mounting the Hugging Face filesystem as an external Stage. The following example demonstrates its usage:

Creating a Stage

For this example, the converted files are located in the /default/train/ PATH on the refs%2Fconvert%2Fparquet branch, and we can directly mount this path.

CREATE STAGE IF NOT EXISTS stage_huggingface_fka_prompts url = 'hf://fka/awesome-chatgpt-prompts/default/train/' connection = (revision = 'refs%2Fconvert%2Fparquet');

Listing Corresponding Files

We can use PATTERN to list filenames in the Stage that match a specific pattern, which is convenient for querying and analysis.

LIST @stage_huggingface_fka_prompts PATTERN = '.*parquet';

Image

Querying Data

Since we have already obtained the file's information from the CSV file, when querying the Parquet file, we can directly select its columns. Considering that the prompt is inconvenient to display in full in the terminal, here it is truncated to 80 characters.

SELECT act, left(prompt, 80) FROM @stage_huggingface_fka_prompts/0000.parquet LIMIT 5;

Image

Data Science with SQL

In the previous section, we discussed how to perform simple queries on remote datasets at Hugging Face. In this section, we will look at a specific example, exploring data science using SQL.

For ease of understanding, we select the commonly used iris classification dataset for beginners in data science as an example. The Iris dataset appears in R.A. Fisher's classic 1936 paper The Use of Multiple Measurements in Taxonomic Problems and can also be found in the UCI Machine Learning Repository, making it very suitable for demonstrating simple classification models.

The Hugging Face dataset used in this article is located at https://huggingface.co/datasets/scikit-learn/iris .

Image

Mounting Data

We will mount the dataset as an external Stage in Databend:

CREATE STAGE IF NOT EXISTS stage_iris url = 'hf://scikit-learn/iris/default/train/' connection = (revision = 'refs%2Fconvert%2Fparquet');

LIST @stage_iris PATTERN = '.*parquet';

SELECT * FROM @stage_iris/0000.parquet LIMIT 5;

Image

Data Cleaning

The data in the external Stage is equivalent to the raw data. To meet the standards and needs for use, you can clean the data during import. For the Iris dataset, typical cleaning operations include:

  • Removing the id column.
  • Converting the species column to categorical labels, i.e., integer form.
  • Converting feature columns from float64 to float32.
  • Renaming feature columns to snake case.

In Databend, data cleaning can be done during loading using the COPY INTO statement.

Creating the Corresponding Table

This statement creates a table in Databend that aligns with the dataset.

CREATE TABLE iris (
sepal_length FLOAT,
sepal_width FLOAT,
petal_length FLOAT,
petal_width FLOAT,
species INT
);

Using COPY INTO to Import Data

The following statement cleans and imports the data in a single command:

COPY INTO iris
FROM (
SELECT
to_float32(t.sepallengthcm),
to_float32(t.sepalwidthcm),
to_float32(t.petallengthcm),
to_float32(t.petalwidthcm),
CASE
WHEN species = 'Iris-setosa' THEN 1
WHEN species = 'Iris-versicolor' THEN 2
WHEN species = 'Iris-virginica' THEN 3
ELSE NULL
END AS species
FROM @stage_iris t
)
FILE_FORMAT = (TYPE = PARQUET)
PATTERN = '.*parquet';

Image

Basic Statistics

The following statement helps us figure out how many items in each specie based on the cleaned data:

SELECT species, COUNT(*) AS count
FROM iris
GROUP BY species;

Image

Basic Analysis

By executing the following statement, we can get the average sepal length for each type of iris:

SELECT species, AVG(sepal_length) as avg_sepal_length
FROM iris
GROUP BY species;

Image

Using SQL to Predict Iris Categories

Mocking up a simple white-box model with SQL can intuitively feel the prediction standards of the model and avoids interacting with model services.

The example below is for demonstration purposes only and does not comply with the rigorous training, validation, and application data science process.

Let's assume we have received criteria for categorizing irises by their features from the work of some data scientists. Now, let's "create a model in Databend" using the following CASE WHEN statement:

SELECT
sepal_length,
sepal_width,
petal_length,
petal_width,
species,
CASE
WHEN petal_width < 0.75 THEN 1
WHEN petal_width >= 0.75 AND petal_width < 1.35 THEN 2
WHEN petal_width >= 0.75 AND petal_width < 1.75 AND petal_width >= 1.35 AND sepal_width < 2.65 THEN 3
WHEN petal_width >= 0.75 AND petal_width < 1.75 AND petal_width >= 1.35 AND sepal_width >= 2.65 THEN 2
WHEN petal_width >= 0.75 AND petal_width >= 1.75 THEN 3
END AS prediction
FROM
iris;

The above SQL will append a prediction column to the selected data, showing the predicted category.

Image

Since the iris dataset already has a category column, we can further calculate the accuracy of the model on the dataset using SQL based on the prediction results:

SELECT
COUNT(*) AS total_count,
SUM(CASE WHEN subquery.species = subquery.prediction THEN 1 ELSE 0 END) AS correct_count,
SUM(CASE WHEN subquery.species <> subquery.prediction THEN 1 ELSE 0 END) AS incorrect_count,
SUM(CASE WHEN subquery.species = subquery.prediction THEN 1 ELSE 0 END) / COUNT(*) AS accuracy
FROM
(
SELECT
sepal_length,
sepal_width,
petal_length,
petal_width,
species,
CASE
WHEN petal_width < 0.75 THEN 1
WHEN petal_width >= 0.75 AND petal_width < 1.35 THEN 2
WHEN petal_width >= 0.75 AND petal_width < 1.75 AND petal_width >= 1.35 AND sepal_width < 2.65 THEN 3
WHEN petal_width >= 0.75 AND petal_width < 1.75 AND petal_width >= 1.35 AND sepal_width >= 2.65 THEN 2
WHEN petal_width >= 0.75 AND petal_width >= 1.75 THEN 3
END AS prediction
FROM
iris
) AS subquery;

Image

Summary

In this article, we demonstrated how to directly access and query Hugging Face datasets using Databend. We also showed how to preprocess data using SQL to conform to data standards. Additionally, by utilizing the CASE WHEN statement, we can easily express simple models, complete prediction tasks on the Iris dataset within SQL, and perform accuracy statistics.

All SQL scripts in the post were executed successfully in Databend Cloud. If you're interested in exploring more, just jump in and give it a shot.