> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ryft.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Apache Spark

> Install the Ryft Spark plugin to capture real-time execution plans and job history, for app visibility and usage-based optimization.

The Ryft Spark plugin listens to Spark events in real-time, which provide detailed information about the execution of Spark jobs, and writes the logs to a dedicated S3 bucket.

### Spark Plugin Configuration

1. Create an S3 bucket in your account that will store Spark event logs, or contact your Ryft representative if you prefer to use a Ryft-managed bucket.
   <Note>It's best to set a retention policy of at least 7 days.</Note>
   <Warning>Note: Verify that your Spark execution role has sufficient permissions to write to the chosen bucket.</Warning>
2. Add the spark plugin dependency to the Spark Application. This is done differently depending on the deployment:

<Tabs>
  <Tab title="Spark 3.5">
    <CodeGroup>
      ```bash spark-submit {2} theme={null}
      spark-submit \
          --packages io.ryft:spark-plugin-3.5_2.13:0.3.6
      ```

      ```json Spark EMR {6} theme={null}
      "configurationOverrides": {
        "applicationConfiguration": [
          {
            "classification": "spark-defaults",
            "properties": {
              "spark.jars.packages": "io.ryft:spark-plugin-3.5_2.13:0.3.6",
            }
          }
        ]
      }
      ```

      ```bash AWS Glue Spark Job {3} theme={null}
      aws glue start-job-run \
          --job-name <job> \
          --arguments '{"--extra-jars":"s3://<yourbucket>/jars/spark-plugin-3.5_2.13-0.3.6.jar"}'
      ```
    </CodeGroup>
  </Tab>

  <Tab title="Spark 3.3">
    <CodeGroup>
      ```bash spark-submit {2} theme={null}
      spark-submit \
          --packages io.ryft:spark-plugin-3.3_2.12:0.3.6
      ```

      ```json Spark EMR {6} theme={null}
      "configurationOverrides": {
        "applicationConfiguration": [
          {
            "classification": "spark-defaults",
            "properties": {
              "spark.jars.packages": "io.ryft:spark-plugin-3.3_2.12:0.3.6",
            }
          }
        ]
      }
      ```

      ```bash AWS Glue Spark Job {3} theme={null}
      aws glue start-job-run \
          --job-name <job> \
          --arguments '{"--extra-jars":"s3://<yourbucket>/jars/spark-plugin-3.3_2.12-0.3.6.jar"}'
      ```
    </CodeGroup>
  </Tab>
</Tabs>

3. Register the Ryft Plugin and set the `spark.eventLog.ryft.dir` config to the bucket defined

<Tabs>
  <Tab title="Spark 3.5">
    <CodeGroup>
      ```bash spark-submit {3-4} theme={null}
      spark-submit \
          --packages io.ryft:spark-plugin-3.5_2.13:0.3.6 \
          --conf spark.eventLog.ryft.dir=s3://<yourbucket> \
          --conf spark.plugins=io.ryft.spark.RyftSparkEventLogPlugin <jar/py file>
      ```

      ```json Spark EMR {7-8} theme={null}
      "configurationOverrides": {
      "applicationConfiguration": [
          {
              "classification": "spark-defaults",
              "properties": {
                  "spark.jars.packages": "io.ryft:spark-plugin-3.5_2.13:0.3.6",
                  "spark.eventLog.ryft.dir": "s3://<target-bucket>/",
                  "spark.plugins": "io.ryft.spark.RyftSparkEventLogPlugin"
              }
          }
      ]
      }
      ```

      ```bash AWS Glue Spark Job {3} theme={null}
      aws glue start-job-run \
          --job-name <job> \
          --arguments '{"--extra-jars":"s3://<yourbucket>/jars/spark-plugin-3.5_2.13-0.3.6.jar","--conf":"spark.eventLog.ryft.dir=s3://<target-bucket>/"}'
      ```
    </CodeGroup>
  </Tab>

  <Tab title="Spark 3.3">
    <CodeGroup>
      ```bash spark-submit {3-4} theme={null}
      spark-submit \
          --packages io.ryft:spark-plugin-3.3_2.12:0.3.6 \
          --conf spark.eventLog.ryft.dir=s3://<yourbucket> \
          --conf spark.plugins=io.ryft.spark.RyftSparkEventLogPlugin <jar/py file>
      ```

      ```json Spark EMR {7-8} theme={null}
      "configurationOverrides": {
      "applicationConfiguration": [
          {
              "classification": "spark-defaults",
              "properties": {
                  "spark.jars.packages": "io.ryft:spark-plugin-3.3_2.12:0.3.6",
                  "spark.eventLog.ryft.dir": "s3://<target-bucket>/",
                  "spark.plugins": "io.ryft.spark.RyftSparkEventLogPlugin"
              }
          }
      ]
      }
      ```

      ```bash AWS Glue Spark Job {3} theme={null}
      aws glue start-job-run \
          --job-name <job> \
          --arguments '{"--extra-jars":"s3://<yourbucket>/jars/spark-plugin-3.3_2.12-0.3.6.jar","--conf":"spark.eventLog.ryft.dir=s3://<target-bucket>/"}'
      ```
    </CodeGroup>
  </Tab>
</Tabs>

<Accordion title="⚙️ Configuring AWS Glue Jobs">
  ## AWS Glue Spark Job Configuration

  ### Adding the Plugin to Your Spark Session

  Configure your Spark session with the Ryft plugin by adding the following configuration:

  <Tabs>
    <Tab title="Spark 3.5">
      <CodeGroup>
        ```python AWS Glue Spark Job theme={null}
        import sys
        from awsglue.utils import getResolvedOptions
        from awsglue.context import GlueContext
        from awsglue.job import Job

        args = getResolvedOptions(sys.argv, ['JOB_NAME'])
        spark = SparkSession.builder \
            .config("spark.plugins", "io.ryft.spark.RyftSparkEventLogPlugin") \
            .config("spark.eventLog.ryft.dir", "s3://<your-bucket>") \
            .getOrCreate()

        job = Job(GlueContext(spark.sparkContext))
        job.init(args['JOB_NAME'], args)
        ```
      </CodeGroup>
    </Tab>

    <Tab title="Spark 3.3">
      <CodeGroup>
        ```python AWS Glue Spark Job theme={null}
        import sys
        from awsglue.utils import getResolvedOptions
        from awsglue.context import GlueContext
        from awsglue.job import Job

        args = getResolvedOptions(sys.argv, ['JOB_NAME'])
        spark = SparkSession.builder \
            .config("spark.plugins", "io.ryft.spark.RyftSparkEventLogPlugin") \
            .config("spark.eventLog.ryft.dir", "s3://<your-bucket>") \
            .getOrCreate()

        job = Job(GlueContext(spark.sparkContext))
        job.init(args['JOB_NAME'], args)
        ```
      </CodeGroup>
    </Tab>
  </Tabs>

  <Note>
    Glue jobs only support a single SparkSession - make sure only one is initialized. Initializing more than one SparkSession can prevent the plugin from being loaded.
  </Note>

  ### Uploading the Plugin JAR to S3

  AWS Glue jobs require the plugin JAR to be available in S3. You can upload it directly from Maven Central using this command:

  <Tabs>
    <Tab title="Spark 3.5">
      <CodeGroup>
        ```bash Upload JAR to S3 theme={null}
        curl -L https://repo1.maven.org/maven2/io/ryft/spark-plugin-3.5_2.13/0.3.6/spark-plugin-3.5_2.13-0.3.6.jar | \
          aws s3 cp - s3://YOUR_BUCKET_NAME/jars/spark-plugin-3.5_2.13-0.3.6.jar
        ```
      </CodeGroup>
    </Tab>

    <Tab title="Spark 3.3">
      <CodeGroup>
        ```bash Upload JAR to S3 theme={null}
        curl -L https://repo1.maven.org/maven2/io/ryft/spark-plugin-3.3_2.12/0.3.6/spark-plugin-3.3_2.12-0.3.6.jar | \
          aws s3 cp - s3://YOUR_BUCKET_NAME/jars/spark-plugin-3.3_2.12-0.3.6.jar
        ```
      </CodeGroup>
    </Tab>
  </Tabs>

  <Note>
    Replace `YOUR_BUCKET_NAME` with your actual bucket name. Ensure your Glue job has the necessary IAM permissions to read from this S3 location.
  </Note>

  ### Configuring the Extra JARs Parameter

  The Glue job needs to include the plugin JAR using the `--extra-jars` parameter. This can be configured in several ways:

  <Tabs>
    <Tab title="AWS Glue Console">
      <Tabs>
        <Tab title="Spark 3.5">
          1. Navigate to **AWS Glue** → **Jobs** → Select your job
          2. Go to the **Job details** tab
          3. Scroll to **Advanced properties**
          4. In the **Job parameters** section, add:
             * **Key:** `--extra-jars`
             * **Value:** `s3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar`
        </Tab>

        <Tab title="Spark 3.3">
          1. Navigate to **AWS Glue** → **Jobs** → Select your job
          2. Go to the **Job details** tab
          3. Scroll to **Advanced properties**
          4. In the **Job parameters** section, add:
             * **Key:** `--extra-jars`
             * **Value:** `s3://your-bucket/jars/spark-plugin-3.3_2.12-0.3.6.jar`
        </Tab>
      </Tabs>
    </Tab>

    <Tab title="AWS CLI">
      <Tabs>
        <Tab title="Spark 3.5">
          Start a job run with the extra JARs parameter:

          <CodeGroup>
            ```bash Start Job Run theme={null}
            aws glue start-job-run \
              --job-name <job-name> \
              --arguments '{"--extra-jars":"s3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar"}'
            ```
          </CodeGroup>
        </Tab>

        <Tab title="Spark 3.3">
          Start a job run with the extra JARs parameter:

          <CodeGroup>
            ```bash Start Job Run theme={null}
            aws glue start-job-run \
              --job-name <job-name> \
              --arguments '{"--extra-jars":"s3://your-bucket/jars/spark-plugin-3.3_2.12-0.3.6.jar"}'
            ```
          </CodeGroup>
        </Tab>
      </Tabs>
    </Tab>

    <Tab title="Programmatic">
      <Tabs>
        <Tab title="Spark 3.5">
          When creating a job programmatically:

          <CodeGroup>
            ```python Create Job with boto3 theme={null}
            import boto3

            glue_client = boto3.client('glue')

            response = glue_client.create_job(
                Name='your-job-name',
                Role='your-glue-service-role',
                Command={
                    'Name': 'glueetl',
                    'ScriptLocation': 's3://your-bucket/scripts/your-script.py'
                },
                DefaultArguments={
                    '--extra-jars': 's3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar'
                }
            )
            ```
          </CodeGroup>
        </Tab>

        <Tab title="Spark 3.3">
          When creating a job programmatically:

          <CodeGroup>
            ```python Create Job with boto3 theme={null}
            import boto3

            glue_client = boto3.client('glue')

            response = glue_client.create_job(
                Name='your-job-name',
                Role='your-glue-service-role',
                Command={
                    'Name': 'glueetl',
                    'ScriptLocation': 's3://your-bucket/scripts/your-script.py'
                },
                DefaultArguments={
                    '--extra-jars': 's3://your-bucket/jars/spark-plugin-3.3_2.12-0.3.6.jar'
                }
            )
            ```
          </CodeGroup>
        </Tab>
      </Tabs>
    </Tab>
  </Tabs>
</Accordion>

<Accordion title="📦 Using the plugin as a dependency">
  Another way to include the plugin in Java or Scala Spark applications is by packaging it directly into your application's uberjar. This embeds the plugin as a dependency, removing the need to reference the plugin JAR separately at runtime.

  ## Add the plugin dependency

  <Tabs>
    <Tab title="Spark 3.5">
      Add this to your `pom.xml`:

      <CodeGroup>
        ```xml pom.xml theme={null}
        <dependency>
            <groupId>io.ryft</groupId>
            <artifactId>spark-plugin-3.5_2.13</artifactId>
            <version>0.3.6</version>
            <scope>compile</scope>
        </dependency>
        ```
      </CodeGroup>
    </Tab>

    <Tab title="Spark 3.3">
      Add this to your `pom.xml`:

      <CodeGroup>
        ```xml pom.xml theme={null}
        <dependency>
            <groupId>io.ryft</groupId>
            <artifactId>spark-plugin-3.3_2.12</artifactId>
            <version>0.3.6</version>
            <scope>compile</scope>
        </dependency>
        ```
      </CodeGroup>
    </Tab>
  </Tabs>

  ## Include in your uberjar

  Configure the Maven Shade plugin to include the dependency:

  <Tabs>
    <Tab title="Spark 3.5">
      <CodeGroup>
        ```xml pom.xml theme={null}
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.6.0</version>
            <configuration>
                <artifactSet>
                    <includes>
                        <include>io.ryft:spark-plugin-3.5_2.13</include>
                    </includes>
                </artifactSet>
            </configuration>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        ```
      </CodeGroup>
    </Tab>

    <Tab title="Spark 3.3">
      <CodeGroup>
        ```xml pom.xml theme={null}
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.6.0</version>
            <configuration>
                <artifactSet>
                    <includes>
                        <include>io.ryft:spark-plugin-3.3_2.12</include>
                    </includes>
                </artifactSet>
            </configuration>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        ```
      </CodeGroup>
    </Tab>
  </Tabs>

  <Note>
    This approach eliminates the need to configure external JARs in your Spark setup, which simplifies application bootstrap and gives you better control over version conflicts. However, it also means you'll need to recompile your application whenever the plugin is updated.
  </Note>
</Accordion>

<Accordion title="🔧 Choosing the Right Plugin Version">
  ## Supported Artifacts and Compatibility

  ## Choosing the Right Plugin Version

  We publish two Spark plugin variants. Use the table below to select the one that matches your environment.

  | Artifact                | Java Version | Spark Version | Scala Versions | Iceberg Version |
  | ----------------------- | ------------ | ------------- | -------------- | --------------- |
  | `spark-plugin-3.3_2.12` | Java 8+      | Spark 3.3     | 2.12           | 1.2.0+          |
  | `spark-plugin-3.3_2.12` | Java 8+      | Spark 3.3     | 2.13           | 1.2.0+          |
  | `spark-plugin-3.5_2.12` | Java 17+     | Spark 3.5     | 2.12           | 1.7.1+          |
  | `spark-plugin-3.5_2.13` | Java 17+     | Spark 3.5     | 2.13           | 1.7.1+          |

  ***

  ### 📌 Notes

  **Java Compatibility**

  * `3.3_x` plugins are compiled for Java 8 and run on Java 8+
  * `3.5_x` plugins require Java 17+

  **Iceberg Compatibility**

  * Minimum supported version is **Iceberg 1.2.0**
  * ⚠️ Using older versions is unsupported
  * ✅ For best results, use the latest Iceberg version officially supported by your Spark distribution

  **Scala Versions**

  * Each plugin is published for **Scala 2.12 and 2.13** - match your Spark distribution’s Scala version

  ***

  ### ✅ Recommended Usage

  * Use **`spark-plugin-3.3_x`** with **Spark 3.3** and **Iceberg 1.2.0+**
  * Use **`spark-plugin-3.5_x`** with **Spark 3.5+** - this is the **preferred and actively maintained** version
</Accordion>

### IAM Permissions

Ensure your Glue job's IAM role has permissions to access the S3 bucket containing the JAR file:

<CodeGroup>
  ```json IAM Policy theme={null}
  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Effect": "Allow",
              "Action": [
                  "s3:GetObject"
              ],
              "Resource": "arn:aws:s3:::your-bucket/jars/*"
          }
      ]
  }
  ```
</CodeGroup>

<Tip>If you are using a Ryft-managed bucket, skip the next steps</Tip>

<Accordion title="Customer managed bucket configuration">
  ### Setup S3 to SQS notifications

  1. Create a new SQS queue that will receive notifications on new files created in your S3 bucket.
  2. Add the following policy to the queue access policy to enable receiving notifications:

  ```json theme={null}
  {
      "Version": "2012-10-17",
      "Id": "S3Notifications",
      "Statement": [
          {
              "Sid": "S3Notifications-statement",
              "Effect": "Allow",
              "Principal": {
                  "Service": "s3.amazonaws.com"
              },
              "Action": [
                  "SQS:SendMessage"
              ],
              "Resource": "<SQS-queue-ARN>",
              "Condition": {
                  "ArnLike": {
                      "aws:SourceArn": "arn:aws:s3:*:*:<yourbucket>"
                  },
                  "StringEquals": {
                      "aws:SourceAccount": "<bucket-owner-account-id>"
                  }
              }
          }
      ]
  }
  ```

  1. [Configure S3 notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html) for new files created (choose "All object create events") in the event logs bucket to be sent to the newly created SQS.

  ### **Add Ryft access policy to S3 and SQS**

  Add the following access policy to the **Ryft-ControlPlaneRole** you already created, to allow reading notifications from the queue.

  1. IAM → Roles → Search for "**Ryft-ControlPlaneRole"** (or the name you used)
  2. Add permissions → Create inline policy → Select the **JSON** tab
  3. Add the following policy, fill in the bucket and the queue parameters

  ```jsx theme={null}
  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Sid": "AllowSparkEventLogAccess",
              "Effect": "Allow",
              "Action": [
                  "s3:ListBucket",
                  "s3:GetObject"
              ],
              "Resource": [
                  "arn:aws:s3:::<yourbucket>/test/*",
                  "arn:aws:s3:::<yourbucket>"
              ]
          },
          {
              "Sid": "SparkEventLogsSqsAccess",
              "Effect": "Allow",
              "Action": [
                  "sqs:ReceiveMessage",
                  "sqs:DeleteMessage"
              ],
              "Resource": "<sqs-queue-arn>"
          }
      ]
  }
  ```
</Accordion>

<Check>
  You are done! Locate the URL of the queue you just created and provide it to Ryft, we will now finish setting up the integration.
  The URL should look similar to: `https://sqs.us-east-1.amazonaws.com/<account-id>/<queue-name>`
</Check>
