The term ODPS is former name of a service that now becomes Maxcompute, this document will try to use Maxcompute for consistency, but some technical parts still uses ODPS
Spark on MaxCompute is a computing service provided by Alibaba Cloud. It is compatible with the open-source Spark. It provides a Spark computing framework based on unified computing resources and a dataset permission system, which allows you to submit and run Spark jobs in your preferred development method. Spark on MaxCompute can fulfill the diverse needs of data processing and analysis.1
In this repo, some common Spark operations is implemented, such as RDD, DataFrame, SparkSQL and MlLib, as well as some Maxcompute-specific operations such as (big data platform - not unlike bigquery or redshift), (object storage), and (unified orchestrator - airflow, if you may)
To develop Spark on Maxcompute projects, a local development environment need to be set up. The easiest way to do this is via , an Intellij IDEA plugin. However, those without IntelliJ IDEA can still setup their own development environment. The following sections will show how.
MacOS or Linux users can add the following variables in ~/.bash_profile
file.
export SPARK_HOME=/path/to/extracted/spark/client/from/step/1/above
export PATH=$SPARK_HOME/bin:$PATH
make sure to run source ~/.bash_profile
from your terminal to load the environment variables.
Go to $SPARK_HOME/conf
directory. In there, a spark-defaults.conf.template
file can be found. Copy this file and rename it to spark-defaults.conf
. Spark on Maxcompute client can be configured in this file.
# spark-defaults.conf
# Enter the MaxCompute project name and account information.
spark.hadoop.odps.project.name = XXX # maxcompute project name
spark.hadoop.odps.access.id = XXX # alibaba cloud account access id
spark.hadoop.odps.access.key = XXX # alibaba cloud account access key
# Retain the following default settings.
Spark.hadoop.odps.end.point = http://service.cn.maxcompute.aliyun.com/api # Find correct endpoints based on your maxcompute project region from: https://www.alibabacloud.com/help/doc-detail/34951.htm
spark.hadoop.odps.runtime.end.point = http://service.cn.maxcompute.aliyun-inc.com/api # Generally same as above
spark.sql.catalogImplementation=odps
spark.hadoop.odps.task.major.version = cupid_v2
spark.hadoop.odps.cupid.container.image.enable = true
spark.hadoop.odps.cupid.container.vm.engine.type = hyper
spark.hadoop.odps.cupid.webproxy.endpoint = http://service.cn.maxcompute.aliyun-inc.com/api
spark.hadoop.odps.moye.trackurl.host = http://jobview.odps.aliyun.com
For some functions, additional configuration might be needed. Refer to for more detail.
The simplest way is to run:
git clone https://github.com/iahsanujunda/maxcompute-spark.git
from the terminal.
Thanks to maven, all build process is already streamlined. simply run:
mvn clean package
This will resolve all dependencies, package a .jar executable, as well as run tests.
To run local development environment, Maxcompute provides two running mode: Local-mode and Cluster-mode
In this mode, Spark on Maxcompute client runs on the local machine but make use of Tunnel to read and write data to Maxcompute resources. Take a note on local[N]
part, N
indicates the number of CPU to be used by the client.
To execute, run:
$SPARK_HOME/bin/spark-submit --master local[4] \
--class com.aliyun.odps.spark.examples.SparkPi \
${path to project directory}/target/maxcompute-spark-1.0-SNAPSHOT.jar
With cluster mode, the Spark program is run on the Maxcompute clusters, note that this means resource files will need to be uploaded to the Maxcompute clusters. Therefore, it might take longer for this mode to execute compared to local mode, based on the internet connection. However, this mode will reflect the actual environment that the code will face on production environment.
To execute, run:
$SPARK_HOME/bin/spark-submit --master yarn-cluster \
--class SparkPi \
${path to project directory}/target/maxcompute-spark-1.0-SNAPSHOT.jar