Commit 0abb79f6 authored by Jonathan Mace's avatar Jonathan Mace
Browse files


parent 635fbcbc
......@@ -4,15 +4,27 @@ To compile, invoke
mvn clean package
For convenience, set TPCDS_WORKLOAD_GEN to the directory where this git repository is checked out, eg:
export TPCDS_WORKLOAD_GEN=~/tpcds
To generate data with spark
bin/spark-submit --master yarn --class ~/tpcds/tpcds-workload-gen/target/spark-workloadgen-4.0-jar-with-dependencies.jar
bin/spark-submit --class ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-4.0-jar-with-dependencies.jar
To run:
bin/spark-submit --master yarn --class ~/tpcds/tpcds-workload-gen/target/spark-workloadgen-4.0-jar-with-dependencies.jar
bin/spark-submit --class ${TPCDS_WORKLOAD_GEN}/target/spark-workloadgen-4.0-jar-with-dependencies.jar
To configure the TPC-DS data set, there are a variety of configuration options. Most of these are inherited from Databricks spark-sql-perf, which we use to generate the TPC-DS data.
The options of interest are as follows:
Benchmark config:
- scaleFactor specifies the dataset size. A scale factor of n generates approximately n GB of data. Most data formats compress this quite effectively, so on disk the data will appear smaller (eg, Parque or Orc can compress by a factor of approximately 4).
- dataLocation specifics the location of the dataset. Typically this will be in HDFS, and you can specify HDFS file locations as normal (eg, hdfs://<hostname>:<port>/<path>)
- dataFormat specifies the format to store the data. "parquet" and "orc" are good choices with high compression; "text" is also supported.
The full (default) configuration options are as follows:
tpcds {
scaleFactor = 1
......@@ -23,4 +35,10 @@ Benchmark config:
useDoubleForDecimal = false
clusterByPartitionColumns = false
filterOutNullPartitionValues = false
\ No newline at end of file
We have provided a couple of useful command line utilities, which are generated into the folder `target/appassembler/bin`:
- list-queries lists the available queries. It takes zero or one arguments; with zero arguments, it lists the available benchmarks; with 1 argument, it either lists a benchmark, or prints a query. Queries are broken down into benchmarks. Since multiple people have implemented variants of the original TPC-DS queries, we have included multiple of these variants here. The impala-tpcds-modified-queries are a set of 20 selected queries that several work has used for benchmarking previously with Spark.
- dsdgen is a wrapper around the dsdgen utility that TPC provides. This package comes with precompiled dsdgen binaries for Linux and Mac, which we use for data generation.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment