> ## Documentation Index > Fetch the complete documentation index at: https://docs.ryft.io/llms.txt > Use this file to discover all available pages before exploring further. # Apache Spark > Install the Ryft Spark plugin to capture real-time execution plans and job history, for app visibility and usage-based optimization. The Ryft Spark plugin listens to Spark events in real-time, which provide detailed information about the execution of Spark jobs, and writes the logs to a dedicated S3 bucket. ### Spark Plugin Configuration 1. Create an S3 bucket in your account that will store Spark event logs, or contact your Ryft representative if you prefer to use a Ryft-managed bucket. It's best to set a retention policy of at least 7 days. Note: Verify that your Spark execution role has sufficient permissions to write to the chosen bucket. 2. Add the spark plugin dependency to the Spark Application. This is done differently depending on the deployment: ```bash spark-submit {2} theme={null} spark-submit \ --packages io.ryft:spark-plugin-3.5_2.13:0.3.6 ``` ```json Spark EMR {6} theme={null} "configurationOverrides": { "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.jars.packages": "io.ryft:spark-plugin-3.5_2.13:0.3.6", } } ] } ``` ```bash AWS Glue Spark Job {3} theme={null} aws glue start-job-run \ --job-name \ --arguments '{"--extra-jars":"s3:///jars/spark-plugin-3.5_2.13-0.3.6.jar"}' ``` ```bash spark-submit {2} theme={null} spark-submit \ --packages io.ryft:spark-plugin-3.3_2.12:0.3.6 ``` ```json Spark EMR {6} theme={null} "configurationOverrides": { "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.jars.packages": "io.ryft:spark-plugin-3.3_2.12:0.3.6", } } ] } ``` ```bash AWS Glue Spark Job {3} theme={null} aws glue start-job-run \ --job-name \ --arguments '{"--extra-jars":"s3:///jars/spark-plugin-3.3_2.12-0.3.6.jar"}' ``` 3. Register the Ryft Plugin and set the `spark.eventLog.ryft.dir` config to the bucket defined ```bash spark-submit {3-4} theme={null} spark-submit \ --packages io.ryft:spark-plugin-3.5_2.13:0.3.6 \ --conf spark.eventLog.ryft.dir=s3:// \ --conf spark.plugins=io.ryft.spark.RyftSparkEventLogPlugin ``` ```json Spark EMR {7-8} theme={null} "configurationOverrides": { "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.jars.packages": "io.ryft:spark-plugin-3.5_2.13:0.3.6", "spark.eventLog.ryft.dir": "s3:///", "spark.plugins": "io.ryft.spark.RyftSparkEventLogPlugin" } } ] } ``` ```bash AWS Glue Spark Job {3} theme={null} aws glue start-job-run \ --job-name \ --arguments '{"--extra-jars":"s3:///jars/spark-plugin-3.5_2.13-0.3.6.jar","--conf":"spark.eventLog.ryft.dir=s3:///"}' ``` ```bash spark-submit {3-4} theme={null} spark-submit \ --packages io.ryft:spark-plugin-3.3_2.12:0.3.6 \ --conf spark.eventLog.ryft.dir=s3:// \ --conf spark.plugins=io.ryft.spark.RyftSparkEventLogPlugin ``` ```json Spark EMR {7-8} theme={null} "configurationOverrides": { "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.jars.packages": "io.ryft:spark-plugin-3.3_2.12:0.3.6", "spark.eventLog.ryft.dir": "s3:///", "spark.plugins": "io.ryft.spark.RyftSparkEventLogPlugin" } } ] } ``` ```bash AWS Glue Spark Job {3} theme={null} aws glue start-job-run \ --job-name \ --arguments '{"--extra-jars":"s3:///jars/spark-plugin-3.3_2.12-0.3.6.jar","--conf":"spark.eventLog.ryft.dir=s3:///"}' ``` ## AWS Glue Spark Job Configuration ### Adding the Plugin to Your Spark Session Configure your Spark session with the Ryft plugin by adding the following configuration: ```python AWS Glue Spark Job theme={null} import sys from awsglue.utils import getResolvedOptions from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ['JOB_NAME']) spark = SparkSession.builder \ .config("spark.plugins", "io.ryft.spark.RyftSparkEventLogPlugin") \ .config("spark.eventLog.ryft.dir", "s3://") \ .getOrCreate() job = Job(GlueContext(spark.sparkContext)) job.init(args['JOB_NAME'], args) ``` ```python AWS Glue Spark Job theme={null} import sys from awsglue.utils import getResolvedOptions from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ['JOB_NAME']) spark = SparkSession.builder \ .config("spark.plugins", "io.ryft.spark.RyftSparkEventLogPlugin") \ .config("spark.eventLog.ryft.dir", "s3://") \ .getOrCreate() job = Job(GlueContext(spark.sparkContext)) job.init(args['JOB_NAME'], args) ``` Glue jobs only support a single SparkSession - make sure only one is initialized. Initializing more than one SparkSession can prevent the plugin from being loaded. ### Uploading the Plugin JAR to S3 AWS Glue jobs require the plugin JAR to be available in S3. You can upload it directly from Maven Central using this command: ```bash Upload JAR to S3 theme={null} curl -L https://repo1.maven.org/maven2/io/ryft/spark-plugin-3.5_2.13/0.3.6/spark-plugin-3.5_2.13-0.3.6.jar | \ aws s3 cp - s3://YOUR_BUCKET_NAME/jars/spark-plugin-3.5_2.13-0.3.6.jar ``` ```bash Upload JAR to S3 theme={null} curl -L https://repo1.maven.org/maven2/io/ryft/spark-plugin-3.3_2.12/0.3.6/spark-plugin-3.3_2.12-0.3.6.jar | \ aws s3 cp - s3://YOUR_BUCKET_NAME/jars/spark-plugin-3.3_2.12-0.3.6.jar ``` Replace `YOUR_BUCKET_NAME` with your actual bucket name. Ensure your Glue job has the necessary IAM permissions to read from this S3 location. ### Configuring the Extra JARs Parameter The Glue job needs to include the plugin JAR using the `--extra-jars` parameter. This can be configured in several ways: 1. Navigate to **AWS Glue** → **Jobs** → Select your job 2. Go to the **Job details** tab 3. Scroll to **Advanced properties** 4. In the **Job parameters** section, add: * **Key:** `--extra-jars` * **Value:** `s3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar` 1. Navigate to **AWS Glue** → **Jobs** → Select your job 2. Go to the **Job details** tab 3. Scroll to **Advanced properties** 4. In the **Job parameters** section, add: * **Key:** `--extra-jars` * **Value:** `s3://your-bucket/jars/spark-plugin-3.3_2.12-0.3.6.jar` Start a job run with the extra JARs parameter: ```bash Start Job Run theme={null} aws glue start-job-run \ --job-name \ --arguments '{"--extra-jars":"s3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar"}' ``` Start a job run with the extra JARs parameter: ```bash Start Job Run theme={null} aws glue start-job-run \ --job-name \ --arguments '{"--extra-jars":"s3://your-bucket/jars/spark-plugin-3.3_2.12-0.3.6.jar"}' ``` When creating a job programmatically: ```python Create Job with boto3 theme={null} import boto3 glue_client = boto3.client('glue') response = glue_client.create_job( Name='your-job-name', Role='your-glue-service-role', Command={ 'Name': 'glueetl', 'ScriptLocation': 's3://your-bucket/scripts/your-script.py' }, DefaultArguments={ '--extra-jars': 's3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar' } ) ``` When creating a job programmatically: ```python Create Job with boto3 theme={null} import boto3 glue_client = boto3.client('glue') response = glue_client.create_job( Name='your-job-name', Role='your-glue-service-role', Command={ 'Name': 'glueetl', 'ScriptLocation': 's3://your-bucket/scripts/your-script.py' }, DefaultArguments={ '--extra-jars': 's3://your-bucket/jars/spark-plugin-3.3_2.12-0.3.6.jar' } ) ``` Another way to include the plugin in Java or Scala Spark applications is by packaging it directly into your application's uberjar. This embeds the plugin as a dependency, removing the need to reference the plugin JAR separately at runtime. ## Add the plugin dependency Add this to your `pom.xml`: ```xml pom.xml theme={null} io.ryft spark-plugin-3.5_2.13 0.3.6 compile ``` Add this to your `pom.xml`: ```xml pom.xml theme={null} io.ryft spark-plugin-3.3_2.12 0.3.6 compile ``` ## Include in your uberjar Configure the Maven Shade plugin to include the dependency: ```xml pom.xml theme={null} org.apache.maven.plugins maven-shade-plugin 3.6.0 io.ryft:spark-plugin-3.5_2.13 package shade ``` ```xml pom.xml theme={null} org.apache.maven.plugins maven-shade-plugin 3.6.0 io.ryft:spark-plugin-3.3_2.12 package shade ``` This approach eliminates the need to configure external JARs in your Spark setup, which simplifies application bootstrap and gives you better control over version conflicts. However, it also means you'll need to recompile your application whenever the plugin is updated. ## Supported Artifacts and Compatibility ## Choosing the Right Plugin Version We publish two Spark plugin variants. Use the table below to select the one that matches your environment. | Artifact | Java Version | Spark Version | Scala Versions | Iceberg Version | | ----------------------- | ------------ | ------------- | -------------- | --------------- | | `spark-plugin-3.3_2.12` | Java 8+ | Spark 3.3 | 2.12 | 1.2.0+ | | `spark-plugin-3.3_2.12` | Java 8+ | Spark 3.3 | 2.13 | 1.2.0+ | | `spark-plugin-3.5_2.12` | Java 17+ | Spark 3.5 | 2.12 | 1.7.1+ | | `spark-plugin-3.5_2.13` | Java 17+ | Spark 3.5 | 2.13 | 1.7.1+ | *** ### 📌 Notes **Java Compatibility** * `3.3_x` plugins are compiled for Java 8 and run on Java 8+ * `3.5_x` plugins require Java 17+ **Iceberg Compatibility** * Minimum supported version is **Iceberg 1.2.0** * ⚠️ Using older versions is unsupported * ✅ For best results, use the latest Iceberg version officially supported by your Spark distribution **Scala Versions** * Each plugin is published for **Scala 2.12 and 2.13** - match your Spark distribution’s Scala version *** ### ✅ Recommended Usage * Use **`spark-plugin-3.3_x`** with **Spark 3.3** and **Iceberg 1.2.0+** * Use **`spark-plugin-3.5_x`** with **Spark 3.5+** - this is the **preferred and actively maintained** version ### IAM Permissions Ensure your Glue job's IAM role has permissions to access the S3 bucket containing the JAR file: ```json IAM Policy theme={null} { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": "arn:aws:s3:::your-bucket/jars/*" } ] } ``` If you are using a Ryft-managed bucket, skip the next steps ### Setup S3 to SQS notifications 1. Create a new SQS queue that will receive notifications on new files created in your S3 bucket. 2. Add the following policy to the queue access policy to enable receiving notifications: ```json theme={null} { "Version": "2012-10-17", "Id": "S3Notifications", "Statement": [ { "Sid": "S3Notifications-statement", "Effect": "Allow", "Principal": { "Service": "s3.amazonaws.com" }, "Action": [ "SQS:SendMessage" ], "Resource": "", "Condition": { "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:" }, "StringEquals": { "aws:SourceAccount": "" } } } ] } ``` 1. [Configure S3 notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html) for new files created (choose "All object create events") in the event logs bucket to be sent to the newly created SQS. ### **Add Ryft access policy to S3 and SQS** Add the following access policy to the **Ryft-ControlPlaneRole** you already created, to allow reading notifications from the queue. 1. IAM → Roles → Search for "**Ryft-ControlPlaneRole"** (or the name you used) 2. Add permissions → Create inline policy → Select the **JSON** tab 3. Add the following policy, fill in the bucket and the queue parameters ```jsx theme={null} { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowSparkEventLogAccess", "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:GetObject" ], "Resource": [ "arn:aws:s3:::/test/*", "arn:aws:s3:::" ] }, { "Sid": "SparkEventLogsSqsAccess", "Effect": "Allow", "Action": [ "sqs:ReceiveMessage", "sqs:DeleteMessage" ], "Resource": "" } ] } ``` You are done! Locate the URL of the queue you just created and provide it to Ryft, we will now finish setting up the integration. The URL should look similar to: `https://sqs.us-east-1.amazonaws.com//`