Running spark lama app on Spark YARN
Next, it will be shown how to run the
examples/spark/tabular-preset-automl.py script for execution on local Hadoop YARN.
Local deployment of Hadoop YARN is done using the docker-hadoop project from the https://github.com/big-data-europe/docker-hadoop repository. It consists of the following services: datanode, historyserver, namenode, nodemanager, resourcemanager. The files
docker-hadoop/docker-compose.yml have been modified and a description of the new service
docker-hadoop/spark-submit has been added. Required tools to get started to work with docker-hadoop project: Docker, Docker Compose and GNU Make.
1. First, let’s go to the LightAutoML project directory
Make sure that in the
dist directory there is a wheel assembly and in the
jars directory there is a jar file.
dist directory does not exist, or if there are no files in it, then you need to build lama dist files.
If there are no jar file(s) in
jars directory, then you need to build lama jar file(s).
2. Distribute lama wheel to nodemanager
Copy lama wheel file from
We copy the lama wheel assembly to the nodemanager Docker file, because later it will be needed in the nodemanager service to execute the pipelines that we will send to spark.
cp dist/LightAutoML-0.3.0-py3-none-any.whl docker-hadoop/nodemanager/LightAutoML-0.3.0-py3-none-any.whl
3. Go to
docker-compose.yml file and configure services.
volumes setting to mount directory with datasets to
hadoop.env file and configure hadoop settings.
Pay attention to the highlighted settings. They need to be set in accordance with the resources of your computers.
6. Build image for
The following command will build the
nodemanager image according to
docker-hadoop/nodemanager/Dockerfile. Python 3.9 and the installation of the lama wheel package has been added to this Dockerfile.
7. Build image for
spark-submit container will be used to submit our applications for execution.
8. Start Hadoop YARN services
or same in detached mode:
docker-compose up -d
Check that all services have started:
resourcemanager is services of Hadoop.
datanode is parts of HDFS.
historyserver is parts of YARN. For more information see the documentation at https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html and https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html.
spark-submit is service to submitting our applications to Hadoop YARN for execution (see step 9).
If one of the services did not up, then you need to look at its logs. For example
docker-compose logs -f resourcemanager
9. Send job to cluster via
docker exec -ti spark-submit bash -c "./bin/slamactl.sh submit-job-yarn dist/LightAutoML-0.3.0.tar.gz,examples/spark/examples_utils.py examples/spark/tabular-preset-automl.py"
10. Monitoring application execution
To monitor application execution, you can use the hadoop web interface (http://localhost:8088), which displays the status of the application, resources and application logs.
Let’s see the information about the application and its logs.
11. Spark WebUI
When the application is running, you can go to the hadoop web interface and get a link to the Spark WebUI.
12. HDFS Web UI
HDFS Web UI is available at http://localhost:9870. Here you can browse your files in HDFS http://localhost:9870/explorer.html. HDFS stores trained pipelines and Spark application files.