Skip to main content
The Ryft Spark plugin listens to Spark events in real-time, which provide detailed information about the execution of Spark jobs, and writes the logs to a dedicated S3 bucket.

Spark Plugin Configuration

  1. Create an S3 bucket in your account that will store Spark event logs, or contact your Ryft representative if you prefer to use a Ryft-managed bucket.
    It’s best to set a retention policy of at least 7 days.
    Note: Verify that your Spark execution role has sufficient permissions to write to the chosen bucket.
  2. Add the spark plugin dependency to the Spark Application. This is done differently depending on the deployment:
spark-submit \
    --packages io.ryft:spark-plugin-3.5_2.13:0.3.6
  1. Register the Ryft Plugin and set the spark.eventLog.ryft.dir config to the bucket defined
spark-submit \
    --packages io.ryft:spark-plugin-3.5_2.13:0.3.6 \
    --conf spark.eventLog.ryft.dir=s3://<yourbucket> \
    --conf spark.plugins=io.ryft.spark.RyftSparkEventLogPlugin <jar/py file>

AWS Glue Spark Job Configuration

Adding the Plugin to Your Spark Session

Configure your Spark session with the Ryft plugin by adding the following configuration:
import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
spark = SparkSession.builder \
    .config("spark.plugins", "io.ryft.spark.RyftSparkEventLogPlugin") \
    .config("spark.eventLog.ryft.dir", "s3://<your-bucket>") \
    .getOrCreate()

job = Job(GlueContext(spark.sparkContext))
job.init(args['JOB_NAME'], args)
Glue jobs only support single SparkSession - make sure only one is initialized. Initializing more than one SparkSessions can prevent the plugin from being loaded

Uploading the Plugin JAR to S3

AWS Glue jobs require the plugin JAR to be available in S3. You can upload it directly from Maven Central using this command:
curl -L https://repo1.maven.org/maven2/io/ryft/spark-plugin-3.5_2.13/0.3.6/spark-plugin-3.5_2.13-0.3.6.jar | \
  aws s3 cp - s3://YOUR_BUCKET_NAME/jars/spark-plugin-3.5_2.13-0.3.6.jar
Replace YOUR_BUCKET_NAME with your actual bucket name. Ensure your Glue job has the necessary IAM permissions to read from this S3 location.

Configuring the Extra JARs Parameter

The Glue job needs to include the plugin JAR using the --extra-jars parameter. This can be configured in several ways:
  1. Navigate to AWS GlueJobs → Select your job
  2. Go to the Job details tab
  3. Scroll to Advanced properties
  4. In the Job parameters section, add:
    • Key: --extra-jars
    • Value: s3://your-bucket/jars/spark-plugin-3.5_2.13-0.3.6.jar
Another way to include the plugin in Java or Scala Spark applications is by packaging it directly into your application’s uberjar. This embeds the plugin as a dependency, removing the need to reference the plugin JAR separately at runtime.

Add the plugin dependency

Add this to your pom.xml:
<dependency>
    <groupId>io.ryft</groupId>
    <artifactId>spark-plugin-3.5_2.13</artifactId>
    <version>0.3.6</version>
    <scope>compile</scope>
</dependency>

Include in your uberjar

Configure the Maven Shade plugin to include the dependency:
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.6.0</version>
    <configuration>
        <artifactSet>
            <includes>
                <include>io.ryft:spark-plugin-3.5_2.13</include>
            </includes>
        </artifactSet>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
</plugin>
This approach eliminates the need to configure external JARs in your Spark setup, which simplifies application bootstrap and gives you better control over version conflicts. However, it also means you’ll need to recompile your application whenever the plugin is updated.

Supported Artifacts and Compatibility

Choosing the Right Plugin Version

We publish two Spark plugin variants. Use the table below to select the one that matches your environment.
ArtifactJava VersionSpark VersionScala VersionsIceberg Version
spark-plugin-3.3_2.12Java 8+Spark 3.32.121.2.0+
spark-plugin-3.3_2.12Java 8+Spark 3.32.131.2.0+
spark-plugin-3.5_2.12Java 17+Spark 3.52.121.7.1+
spark-plugin-3.5_2.13Java 17+Spark 3.52.131.7.1+

📌 Notes

Java Compatibility
  • 3.3_x plugins are compiled for Java 8 and run on Java 8+
  • 3.5_x plugins require Java 17+
Iceberg Compatibility
  • Minimum supported version is Iceberg 1.2.0
  • ⚠️ Using older versions is unsupported
  • ✅ For best results, use the latest Iceberg version officially supported by your Spark distribution
Scala Versions
  • Each plugin is published for Scala 2.12 and 2.13 - match your Spark distribution’s Scala version

  • Use spark-plugin-3.3_x with Spark 3.3 and Iceberg 1.2.0+
  • Use spark-plugin-3.5_x with Spark 3.5+ - this is the preferred and actively maintained version

IAM Permissions

Ensure your Glue job’s IAM role has permissions to access the S3 bucket containing the JAR file:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::your-bucket/jars/*"
        }
    ]
}

IAM Permissions

Ensure your Glue job’s IAM role has permissions to access the S3 bucket containing the JAR file:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::your-bucket/jars/*"
        }
    ]
}
If you are using a Ryft-managed bucket, skip the next steps

Setup S3 to SQS notifications

  1. Create a new SQS queue that will receive notifications on new files created in your S3 bucket.
  2. Add the following policy to the queue access policy to enable receiving notifications:
{
    "Version": "2012-10-17",
    "Id": "S3Notifications",
    "Statement": [
        {
            "Sid": "S3Notifications-statement",
            "Effect": "Allow",
            "Principal": {
                "Service": "s3.amazonaws.com"
            },
            "Action": [
                "SQS:SendMessage"
            ],
            "Resource": "<SQS-queue-ARN>",
            "Condition": {
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:s3:*:*:<yourbucket>"
                },
                "StringEquals": {
                    "aws:SourceAccount": "<bucket-owner-account-id>"
                }
            }
        }
    ]
}
  1. Configure S3 notifications for new files created (choose “All object create events”) in the event logs bucket to be sent to the newly created SQS.

Add Ryft access policy to S3 and SQS

Add the following access policy to the Ryft-ControlPlaneRole you already created, to allow reading notifications from the queue.
  1. IAM → Roles → Search for “Ryft-ControlPlaneRole” (or the name you used)
  2. Add permissions → Create inline policy → Select the JSON tab
  3. Add the following policy, fill in the bucket and the queue parameters
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowSparkEventLogAccess",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<yourbucket>/test/*",
                "arn:aws:s3:::<yourbucket>"
            ]
        },
        {
            "Sid": "SparkEventLogsSqsAccess",
            "Effect": "Allow",
            "Action": [
                "sqs:ReceiveMessage",
                "sqs:DeleteMessage"
            ],
            "Resource": "<sqs-queue-arn>"
        }
    ]
}
You are done! Locate the URL of the queue you just created and provide it to Ryft, we will now finish setting up the integration. The URL should look similar to: https://sqs.us-east-1.amazonaws.com/<account-id>/<queue-name>