Install Spark in Local
Table of Contents:
Docker build
- java
sudo apt install openjdk-17-jdk
- scala , spark 설치 후 환경 변수 설정
설치
wget https://www.apache.org/dyn/closer.lua/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3-scala2.13.tgz
tar xvf spark-3.4.2-bin-hadoop3-scala2.13.tgz
sudo mv ./spark-3.4.2-bin-hadoop3-scala2.13.tgz /opt/spark
환경 변수
$ vim ~/.bashrc
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
source ~/.bashrc
docker build -t spark:3.4.2 -f kubernetes/dockerfiles/spark/Dockerfile .
Custom Dockerfile
1차 시도: Spark 공식 Docs에 있는 이미지를 받아와 만들었다.
Dockerfile
version: "3.5" services: spark-master: image: spark:3.4.2 ports: - "9090:8080" - "7077:7077" volumes: - ./jars:/opt/jars - ./data:/opt/data environment: - SPARK_MASTER_HOST="spark://spark-master:7077" - SPARK_MASTER_WEBUI_PORT=8080 - SPARK_PUBLIC_DNS={MASTER_IP} - SPARK_LOCAL_IP=172.16.11.59 - SPARK_LOG_DIR=/opt/spark/logs
spark default conf
spark.master spark://master:7077 spark.eventLog.enabled true spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 5g spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark env ```sh SPARK_MASTER_PORT=7077 \
SPARK_MASTER_WEBUI_PORT=8080 \
SPARK_LOG_DIR=/opt/spark/logs \
SPARK_WORKER_WEBUI_PORT=8080 \
SPARK_WORKER_PORT=7000 \
SPARK_MASTER_HOST="spark://spark-master:7077" SPARK_LOCAL_IP, to set the IP address Spark binds to on this node SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
> [!fail]
> 이 도커파일에는 spark-worker가 없다. 그리고 굳이 필요 없는 환경 변수 값을 넣었다. docker compose의 enviroment를 설정 했으면 default conf, spark env를 설정 할 필요 없다.
# ❌❌❌❌❌StandAlone
```bash
./sbin/start-worker.sh 133.186.217.113:8080
spark://bami-cluster2.novalocal:7077
https://taaewoo.tistory.coㅋㅌㅊ 퓨m/18?category=887744 ' akfhrmpTd미ㅏㄹ얼미ㅏ넝ㄹ니ㅏㅇ'ㅓㄹ
접속하면 되기는 한데;
docker comp ose by bitnami
# Copyright VMware, Inc.
# SPDX-License-Identifier: APACHE-2.0
version: '2'
services:
spark:
image: docker.io/bitnami/spark:3.4.2
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
ports:
- '8080:8080'
spark-worker:
image: docker.io/bitnami/spark:3.4.2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
최종 Dockerfile
kafka , image를 사용하기 위해서는 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, image를 따로 만들어야하나 …………………………………………………………모르겠어
FROM bitnami/spark:3.4.2
USER root
# Install necessary packages
RUN apt-get update && \
apt-get install -y --no-install-recommends \
vim \
curl && \
rm -rf /var/lib/apt/lists/*
USER 1001
# Download Kafka client JAR
RUN curl -o /opt/bitnami/spark/jars/kafka-clients-3.4.1.jar https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.4.1/kafka-clients-3.4.1.jar
# Download Spark Kafka connector JAR
RUN curl -o /opt/bitnami/spark/jars/spark-token-provider-kafka-0-10_2.13-3.4.2.jar https://repo1.maven.org/maven2/org/apache/spark/spark-token-provider-kafka-0-10_2.13/3.4.2/spark-token-provider-kafka-0-10_2.13-3.4.2.jar
docker build --no-cache -t seunghyejeong/spark:1.0 .
docker push seunghyejeong/spark:1.0
docker compose.yaml ```yaml
Copyright VMware, Inc.
SPDX-License-Identifier: APACHE-2.0
version: '2'
services: spark: image: seunghyejeong/spark:1.0 environment: - SPARK_MODE=master - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no - SPARK_USER=spark ports: - '8080:8080' - '7077:7077' spark-worker: image: seunghyejeong/spark:1.0 environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://spark:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no - SPARK_USER=spark
133.186.217.113:19092
spark://spark:7077
spark-sql_2.12-3.4.2.jar
spark-sql-kafka-0-10_2.13-3.4.2.jar
./spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.13:3.4.2 consume_topic.py
```python
# consume_topic.py
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sc.sparkContext.setLogLevel('ERROR')
# Read stream
log = sc.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "133.186.217.113:19092") \
.option("subscribe", "test1") \
.option("startingOffsets", "earliest") \
.load()
# Write stream - console
query = log.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("console") \
.option("truncate", "false") \
.start()
# Write stream - HDFS
query2 = log.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("parquet") \
.outputMode("append") \
.option("checkpointLocation", "/check") \
.option("path", "/test") \
.start()
query.awaitTermination()
query2.awaitTermination()
Docker build
- java
sudo apt install openjdk-17-jdk
- scala , spark 설치 후 환경 변수 설정
설치
wget https://www.apache.org/dyn/closer.lua/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3-scala2.13.tgz
tar xvf spark-3.4.2-bin-hadoop3-scala2.13.tgz
sudo mv ./spark-3.4.2-bin-hadoop3-scala2.13.tgz /opt/spark
환경 변수
$ vim ~/.bashrc
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
source ~/.bashrc
docker build -t spark:3.4.2 -f kubernetes/dockerfiles/spark/Dockerfile .
Custom Dockerfile
1차 시도: Spark 공식 Docs에 있는 이미지를 받아와 만들었다.
Dockerfile
version: "3.5" services: spark-master: image: spark:3.4.2 ports: - "9090:8080" - "7077:7077" volumes: - ./jars:/opt/jars - ./data:/opt/data environment: - SPARK_MASTER_HOST="spark://spark-master:7077" - SPARK_MASTER_WEBUI_PORT=8080 - SPARK_PUBLIC_DNS={MASTER_IP} - SPARK_LOCAL_IP=172.16.11.59 - SPARK_LOG_DIR=/opt/spark/logs
spark default conf
spark.master spark://master:7077 spark.eventLog.enabled true spark.eventLog.dir hdfs://namenode:8021/directory spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 5g spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark env ```sh SPARK_MASTER_PORT=7077 \
SPARK_MASTER_WEBUI_PORT=8080 \
SPARK_LOG_DIR=/opt/spark/logs \
SPARK_WORKER_WEBUI_PORT=8080 \
SPARK_WORKER_PORT=7000 \
SPARK_MASTER_HOST="spark://spark-master:7077" SPARK_LOCAL_IP, to set the IP address Spark binds to on this node SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
> [!fail]
> 이 도커파일에는 spark-worker가 없다. 그리고 굳이 필요 없는 환경 변수 값을 넣었다. docker compose의 enviroment를 설정 했으면 default conf, spark env를 설정 할 필요 없다.
# ❌❌❌❌❌StandAlone
```bash
./sbin/start-worker.sh 133.186.217.113:8080
spark://bami-cluster2.novalocal:7077
https://taaewoo.tistory.coㅋㅌㅊ 퓨m/18?category=887744 ' akfhrmpTd미ㅏㄹ얼미ㅏ넝ㄹ니ㅏㅇ'ㅓㄹ
접속하면 되기는 한데;
docker comp ose by bitnami
# Copyright VMware, Inc.
# SPDX-License-Identifier: APACHE-2.0
version: '2'
services:
spark:
image: docker.io/bitnami/spark:3.4.2
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
ports:
- '8080:8080'
spark-worker:
image: docker.io/bitnami/spark:3.4.2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
최종 Dockerfile
kafka , image를 사용하기 위해서는 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, image를 따로 만들어야하나 …………………………………………………………모르겠어
FROM bitnami/spark:3.4.2
USER root
# Install necessary packages
RUN apt-get update && \
apt-get install -y --no-install-recommends \
vim \
curl && \
rm -rf /var/lib/apt/lists/*
USER 1001
# Download Kafka client JAR
RUN curl -o /opt/bitnami/spark/jars/kafka-clients-3.4.1.jar https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.4.1/kafka-clients-3.4.1.jar
# Download Spark Kafka connector JAR
RUN curl -o /opt/bitnami/spark/jars/spark-token-provider-kafka-0-10_2.13-3.4.2.jar https://repo1.maven.org/maven2/org/apache/spark/spark-token-provider-kafka-0-10_2.13/3.4.2/spark-token-provider-kafka-0-10_2.13-3.4.2.jar
docker build --no-cache -t seunghyejeong/spark:1.0 .
docker push seunghyejeong/spark:1.0
docker compose.yaml ```yaml
Copyright VMware, Inc.
SPDX-License-Identifier: APACHE-2.0
version: '2'
services: spark: image: seunghyejeong/spark:1.0 environment: - SPARK_MODE=master - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no - SPARK_USER=spark ports: - '8080:8080' - '7077:7077' spark-worker: image: seunghyejeong/spark:1.0 environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://spark:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no - SPARK_USER=spark
133.186.217.113:19092
spark://spark:7077
spark-sql_2.12-3.4.2.jar
spark-sql-kafka-0-10_2.13-3.4.2.jar
./spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.13:3.4.2 consume_topic.py
```python
# consume_topic.py
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sc.sparkContext.setLogLevel('ERROR')
# Read stream
log = sc.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "133.186.217.113:19092") \
.option("subscribe", "test1") \
.option("startingOffsets", "earliest") \
.load()
# Write stream - console
query = log.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("console") \
.option("truncate", "false") \
.start()
# Write stream - HDFS
query2 = log.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("parquet") \
.outputMode("append") \
.option("checkpointLocation", "/check") \
.option("path", "/test") \
.start()
query.awaitTermination()
query2.awaitTermination()
About Hallo. 안녕하세요! 정승혜 입니다. 개발 일지 뿐만 아니라 나의 관심 있는 모든 것을 담을거예요.