PySpark

pyspark 是 Apache Spark 的 Python 介面，它支援在分散式環境中，使用 Python 進行大規模資料處理和即時分析。

有關如何使用 PySpark 分析 Hub 上資料集的詳細指南，請檢視這篇部落格。

要開始在 PySpark 中使用 Parquet 檔案，您需要首先將檔案新增到 Spark 上下文。以下是讀取單個 Parquet 檔案的示例：

from pyspark import SparkFiles, SparkContext, SparkFiles
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("WineReviews").getOrCreate()

# Add the Parquet file to the Spark context
spark.sparkContext.addFile("https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet/default/train/0.parquet")

# Read the Parquet file into a DataFrame
df = spark.read.parquet(SparkFiles.get("0.parquet"))

如果您的資料集被分片到多個 Parquet 檔案中，您需要將每個檔案單獨新增到 Spark 上下文。操作方法如下：

import requests

# Fetch the URLs of the Parquet files for the train split
r = requests.get('https://huggingface.co/api/datasets/james-burton/wine_reviews/parquet')
train_parquet_files = r.json()['default']['train']

# Add each Parquet file to the Spark context
for url in train_parquet_files:
  spark.sparkContext.addFile(url)

# Read all Parquet files into a single DataFrame
df = spark.read.parquet(SparkFiles.getRootDirectory() + "/*.parquet")

將資料載入到 PySpark DataFrame 後，您可以執行各種操作來探索和分析資料。

print(f"Shape of the dataset: {df.count()}, {len(df.columns)}")

# Display first 10 rows
df.show(n=10)

# Get a statistical summary of the data
df.describe().show()

# Print the schema of the DataFrame
df.printSchema()

< > 在 GitHub 上更新

資料集檢視器

PySpark