Apache Spark

The Ryft Spark plugin listens to Spark events in real-time, which provide detailed information about the execution of Spark jobs, and writes the logs to a dedicated S3 bucket.

Spark Plugin Configuration

Create an S3 bucket in your account that will store Spark event logs, or contact your Ryft representative if you prefer to use a Ryft-managed bucket.
It’s best to set a retention policy of at least 7 days.

Note: Verify that your Spark execution role has sufficient permissions to write to the chosen bucket.
Add the spark plugin dependency to the Spark Application. This is done differently depending on the deployment:

Spark 3.5
Spark 3.3

spark-submit \
    --packages io.ryft:spark-plugin-3.5_2.13:0.3.6

spark-submit \
    --packages io.ryft:spark-plugin-3.3_2.12:0.3.6

Spark 3.5
Spark 3.3

spark-submit \
    --packages io.ryft:spark-plugin-3.5_2.13:0.3.6 \
    --conf spark.eventLog.ryft.dir=s3://<yourbucket> \
    --conf spark.plugins=io.ryft.spark.RyftSparkEventLogPlugin <jar/py file>

spark-submit \
    --packages io.ryft:spark-plugin-3.3_2.12:0.3.6 \
    --conf spark.eventLog.ryft.dir=s3://<yourbucket> \
    --conf spark.plugins=io.ryft.spark.RyftSparkEventLogPlugin <jar/py file>

⚙️ Configuring AWS Glue Jobs

AWS Glue Spark Job Configuration

Adding the Plugin to Your Spark Session

Configure your Spark session with the Ryft plugin by adding the following configuration:

Spark 3.5
Spark 3.3

import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
spark = SparkSession.builder \
    .config("spark.plugins", "io.ryft.spark.RyftSparkEventLogPlugin") \
    .config("spark.eventLog.ryft.dir", "s3://<your-bucket>") \
    .getOrCreate()

job = Job(GlueContext(spark.sparkContext))
job.init(args['JOB_NAME'], args)

import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
spark = SparkSession.builder \
    .config("spark.plugins", "io.ryft.spark.RyftSparkEventLogPlugin") \
    .config("spark.eventLog.ryft.dir", "s3://<your-bucket>") \
    .getOrCreate()

job = Job(GlueContext(spark.sparkContext))
job.init(args['JOB_NAME'], args)

Glue jobs only support single SparkSession - make sure only one is initialized. Initializing more than one SparkSessions can prevent the plugin from being loaded

Uploading the Plugin JAR to S3

AWS Glue jobs require the plugin JAR to be available in S3. You can upload it directly from Maven Central using this command:

Spark 3.5
Spark 3.3

curl -L https://repo1.maven.org/maven2/io/ryft/spark-plugin-3.5_2.13/0.3.6/spark-plugin-3.5_2.13-0.3.6.jar | \
  aws s3 cp - s3://YOUR_BUCKET_NAME/jars/spark-plugin-3.5_2.13-0.3.6.jar

curl -L https://repo1.maven.org/maven2/io/ryft/spark-plugin-3.3_2.12/0.3.6/spark-plugin-3.3_2.12-0.3.6.jar | \
  aws s3 cp - s3://YOUR_BUCKET_NAME/jars/spark-plugin-3.3_2.12-0.3.6.jar

Replace YOUR_BUCKET_NAME with your actual bucket name. Ensure your Glue job has the necessary IAM permissions to read from this S3 location.

Configuring the Extra JARs Parameter

The Glue job needs to include the plugin JAR using the --extra-jars parameter. This can be configured in several ways:

AWS Glue Console
AWS CLI
Programmatic

Spark 3.5
Spark 3.3

Navigate to AWS Glue → Jobs → Select your job
Go to the Job details tab
Scroll to Advanced properties
In the Job parameters section, add:
- Key: --extra-jars
- Value: s3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar

Navigate to AWS Glue → Jobs → Select your job
Go to the Job details tab
Scroll to Advanced properties
In the Job parameters section, add:
- Key: --extra-jars
- Value: s3://your-bucket/jars/spark-plugin-3.3_2.12-0.3.6.jar

Spark 3.5
Spark 3.3

Start a job run with the extra JARs parameter:

aws glue start-job-run \
  --job-name <job-name> \
  --arguments '{"--extra-jars":"s3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar"}'

Start a job run with the extra JARs parameter:

aws glue start-job-run \
  --job-name <job-name> \
  --arguments '{"--extra-jars":"s3://your-bucket/jars/spark-plugin-3.3_2.12-0.3.6.jar"}'

Spark 3.5
Spark 3.3

When creating a job programmatically:

import boto3

glue_client = boto3.client('glue')

response = glue_client.create_job(
    Name='your-job-name',
    Role='your-glue-service-role',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://your-bucket/scripts/your-script.py'
    },
    DefaultArguments={
        '--extra-jars': 's3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar'
    }
)

When creating a job programmatically:

import boto3

glue_client = boto3.client('glue')

response = glue_client.create_job(
    Name='your-job-name',
    Role='your-glue-service-role',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://your-bucket/scripts/your-script.py'
    },
    DefaultArguments={
        '--extra-jars': 's3://your-bucket/jars/spark-plugin-3.3_2.12-0.3.6.jar'
    }
)

📦 Using the plugin as a dependency

Another way to include the plugin in Java or Scala Spark applications is by packaging it directly into your application’s uberjar. This embeds the plugin as a dependency, removing the need to reference the plugin JAR separately at runtime.

Add the plugin dependency

Spark 3.5
Spark 3.3

Add this to your pom.xml:

<dependency>
    <groupId>io.ryft</groupId>
    <artifactId>spark-plugin-3.5_2.13</artifactId>
    <version>0.3.6</version>
    <scope>compile</scope>
</dependency>

Add this to your pom.xml:

<dependency>
    <groupId>io.ryft</groupId>
    <artifactId>spark-plugin-3.3_2.12</artifactId>
    <version>0.3.6</version>
    <scope>compile</scope>
</dependency>

Include in your uberjar

Configure the Maven Shade plugin to include the dependency:

Spark 3.5
Spark 3.3

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.6.0</version>
    <configuration>
        <artifactSet>
            <includes>
                <include>io.ryft:spark-plugin-3.5_2.13</include>
            </includes>
        </artifactSet>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
</plugin>

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.6.0</version>
    <configuration>
        <artifactSet>
            <includes>
                <include>io.ryft:spark-plugin-3.3_2.12</include>
            </includes>
        </artifactSet>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
</plugin>

This approach eliminates the need to configure external JARs in your Spark setup, which simplifies application bootstrap and gives you better control over version conflicts. However, it also means you’ll need to recompile your application whenever the plugin is updated.

🔧 Choosing the Right Plugin Version

Supported Artifacts and Compatibility

Choosing the Right Plugin Version

We publish two Spark plugin variants. Use the table below to select the one that matches your environment.

Artifact	Java Version	Spark Version	Scala Versions	Iceberg Version
`spark-plugin-3.3_2.12`	Java 8+	Spark 3.3	2.12	1.2.0+
`spark-plugin-3.3_2.12`	Java 8+	Spark 3.3	2.13	1.2.0+
`spark-plugin-3.5_2.12`	Java 17+	Spark 3.5	2.12	1.7.1+
`spark-plugin-3.5_2.13`	Java 17+	Spark 3.5	2.13	1.7.1+

📌 Notes

Java Compatibility

3.3_x plugins are compiled for Java 8 and run on Java 8+
3.5_x plugins require Java 17+

Iceberg Compatibility

Minimum supported version is Iceberg 1.2.0
⚠️ Using older versions is unsupported
✅ For best results, use the latest Iceberg version officially supported by your Spark distribution

Scala Versions

Each plugin is published for Scala 2.12 and 2.13 - match your Spark distribution’s Scala version

✅ Recommended Usage

Use spark-plugin-3.3_x with Spark 3.3 and Iceberg 1.2.0+
Use spark-plugin-3.5_x with Spark 3.5+ - this is the preferred and actively maintained version

IAM Permissions

Ensure your Glue job’s IAM role has permissions to access the S3 bucket containing the JAR file:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::your-bucket/jars/*"
        }
    ]
}

IAM Permissions

Ensure your Glue job’s IAM role has permissions to access the S3 bucket containing the JAR file:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::your-bucket/jars/*"
        }
    ]
}

If you are using a Ryft-managed bucket, skip the next steps

Customer managed bucket configuration

Setup S3 to SQS notifications

Create a new SQS queue that will receive notifications on new files created in your S3 bucket.
Add the following policy to the queue access policy to enable receiving notifications:

{
    "Version": "2012-10-17",
    "Id": "S3Notifications",
    "Statement": [
        {
            "Sid": "S3Notifications-statement",
            "Effect": "Allow",
            "Principal": {
                "Service": "s3.amazonaws.com"
            },
            "Action": [
                "SQS:SendMessage"
            ],
            "Resource": "<SQS-queue-ARN>",
            "Condition": {
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:s3:*:*:<yourbucket>"
                },
                "StringEquals": {
                    "aws:SourceAccount": "<bucket-owner-account-id>"
                }
            }
        }
    ]
}

Configure S3 notifications for new files created (choose “All object create events”) in the event logs bucket to be sent to the newly created SQS.

Add Ryft access policy to S3 and SQS

Add the following access policy to the Ryft-ControlPlaneRole you already created, to allow reading notifications from the queue.

IAM → Roles → Search for “Ryft-ControlPlaneRole” (or the name you used)
Add permissions → Create inline policy → Select the JSON tab
Add the following policy, fill in the bucket and the queue parameters

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowSparkEventLogAccess",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<yourbucket>/test/*",
                "arn:aws:s3:::<yourbucket>"
            ]
        },
        {
            "Sid": "SparkEventLogsSqsAccess",
            "Effect": "Allow",
            "Action": [
                "sqs:ReceiveMessage",
                "sqs:DeleteMessage"
            ],
            "Resource": "<sqs-queue-arn>"
        }
    ]
}

You are done! Locate the URL of the queue you just created and provide it to Ryft, we will now finish setting up the integration. The URL should look similar to: https://sqs.us-east-1.amazonaws.com/<account-id>/<queue-name>

Getting Started

Integrations

Lakehouse Management

Administration

Spark Plugin Configuration

AWS Glue Spark Job Configuration

Adding the Plugin to Your Spark Session

Uploading the Plugin JAR to S3

Configuring the Extra JARs Parameter

Add the plugin dependency

Include in your uberjar

Supported Artifacts and Compatibility

Choosing the Right Plugin Version

📌 Notes

✅ Recommended Usage

IAM Permissions

IAM Permissions

Setup S3 to SQS notifications

Add Ryft access policy to S3 and SQS

Getting Started

Integrations

Lakehouse Management

Administration

​Spark Plugin Configuration

​AWS Glue Spark Job Configuration

​Adding the Plugin to Your Spark Session

​Uploading the Plugin JAR to S3

​Configuring the Extra JARs Parameter

​Add the plugin dependency

​Include in your uberjar

​Supported Artifacts and Compatibility

​Choosing the Right Plugin Version

​📌 Notes

​✅ Recommended Usage

​IAM Permissions

​IAM Permissions

​Setup S3 to SQS notifications

​Add Ryft access policy to S3 and SQS

Spark Plugin Configuration

AWS Glue Spark Job Configuration

Adding the Plugin to Your Spark Session

Uploading the Plugin JAR to S3

Configuring the Extra JARs Parameter

Add the plugin dependency

Include in your uberjar

Supported Artifacts and Compatibility

Choosing the Right Plugin Version

📌 Notes

✅ Recommended Usage

IAM Permissions

IAM Permissions

Setup S3 to SQS notifications

Add Ryft access policy to S3 and SQS