1 - About Lucille

Understanding what Lucille is, why you should use it, and how it works.

What is Lucille?

Lucille is a production-grade Search ETL solution designed to efficiently get data into Lucene-based search engines such as Elasticsearch, OpenSearch, and Solr as well as vector databases such as Pinecone & Weaviate. Lucille enables complex processing of documents before they are indexed by search engine, freeing up resources that can be used by the search engine to compute queries with greater speed.

Lucille is Java-based and open-source. Lucille supports batch, incremental, and streaming data ingestion architectures.

Why use Lucille?

Search ETL is a category of ETL problem where data must be extracted from a source system, transformed, and loaded into a search engine.

A Search ETL solution must speak the language of search: it must represent data in the form of search-engine-ready Documents, it must know how to enrich Documents to support common search use cases, and it must follow best practices for interacting with search engines including support for batching, routing, and versioning.

To be production-grade, a search ETL solution must be scalable, reliable, and easy to use. It should support parallel Document processing, it should be observable, it should be easy to configure, it should have extensive test coverage, and it should have been hardened through multiple challenging real-world deployments.

Lucille handles all of these things so you don’t have to. Lucille helps you get your data into Lucene-based search engines like Apache Solr, Elasticsearch, or OpenSearch as well as vector-based search engines like Pinecone and Weaviate, and it helps you keep that search engine content up-to-date as your backend data changes. Lucille does this in a way that scales as your data volume grows, and in a way that’s easy to evolve as your data transformation requirements become more complex. Lucille implements search best practices so you can stay focused on your data itself and what you want to do with it.

How Does Lucille Work?

The basic architectural ideas of Lucille are as follows:

  1. A Connector retrieves data from a source system.
  2. Worker(s) enrich the data.
  3. Indexer(s) index the data into a search engine.
  4. These three core components (Connectors, Workers, and Indexers) run concurrently and communicate with each other using a messaging framework.
    • The core components can function as threads inside a JVM, allowing for all of Lucille run as a single Java process, allowing for a simple and easy deployment model.
    • The core components can function as standalone Java processes communicating through an external Apache Kafka message broker, allowing for massive scale.
  5. Documents are enriched en-route to the search engine using a Pipeline built from composable processing Stages. The pipeline is configuration-driven.
  6. Document lifecycle events (such as creation, indexing, and erroring out) are tracked so that the framework can determine when all the work in a batch ingest is complete.

More Information

The Lucille project is developed and maintained by KMW Technology (kmwllc.com). For more information regarding Lucille, please contact us.

2 - Getting Started

Understanding the basics to quickly get started.

Installation

See the installation guide to install prerequisites, clone the repository, and build Lucille.

Try it out

Lucille includes a few examples in the lucille-examples module to help you get started.

To see how to ingest the contents of a local CSV file into an instance of Apache Solr, refer to the simple-csv-solr-example.

To run this example, start an instance of Apache Solr on port 8983 and create a collection called quickstart. For more information about how to use Solr, see the Apache Solr Reference Guide).

Go to lucille-examples/lucille-simple-csv-solr-example in your working copy of Lucille and run:

mvn clean install

./scripts/run_ingest.sh

This script executes Lucille with a configuration file named simple-csv-solr-example.conf that tells Lucille to read a CSV of top songs and send each row as a document to Solr.

Run a commit with openSearcher=true on your quickstart collection to make the documents visible. Go to your Solr admin dashboard, execute a *:* query and you should see the songs from the source file now visible as Solr documents.

Quick Start Guide - Local Mode

Scope: The steps below run Lucille from a source build (built locally with Maven).

What is Local Mode?

Local mode runs all Lucille components (connector, pipeline, and indexer) inside a single JVM process that you start locally. Your configuration may still interact with external systems (e.g., S3, Solr, OpenSearch/Elasticsearch), but the Lucille runtime itself executes entirely within that single JVM.

Prepare a Configuration File

You’ll run Lucille by pointing it at a config file that declares your connectors, pipelines, and indexers. See the configuration docs for the full schema and supported components.

Run Lucille Locally

From the repository root, run the Runner with your config file:

java \
  -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> \
  -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
  com.kmwllc.lucille.core.Runner

What this Does

  • -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> tells Lucille where to find your configuration.
  • -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' loads Lucille and its dependencies.
  • com.kmwllc.lucille.core.Runner boots the Lucille engine in local mode and runs the configured pipeline to completion.

Trouble Running Lucille?

See the troubleshooting guide for common pitfalls.

Quick Start Guide - Distributed Mode

What is Distributed Mode?

Distributed mode allows you to scale Lucille to take advantage of available hardware by running each Lucille component in its own JVM and using Kafka for document transport and event tracking. You start:

  • A Runner (Publisher + Connectors) to publish documents onto Kafka.
  • One or more Workers to process the documents through a pipeline.
  • An Indexer to write the processed documents to your destination (Solr, OpenSearch, Elasticsearch, CSV, etc.).

This guide assumes Kafka and your destination system are already running and reachable. This guide focuses on running Lucille itself. For details on configuration structure and component options, see the corresponding docs.

Prepare a Configuration File

You’ll run Lucille by pointing it at a config file that declares your pipeline. See the configuration docs for the full schema and supported components.

Use a single config that defines: your connector(s), your pipeline(s), kafka configuration, and your indexer and its backend config (e.g., solr {}, opensearch {}, etc).

Start Components (Separate JVMs)

A) Start the Runner (publishes to Kafka)

The runner publishes documents to the Kafka source topic, listens for pipeline run events, logs run statistics, and waits for the run to complete.

java \
 -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> \
 -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
 com.kmwllc.lucille.core.Runner \
 -useKafka

B) Start one or more Workers

Each worker consumes documents from the Kafka source topic, processes each document through the configured pipeline, and writes the processed documents to the Kafka destination topic.

java \
 -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> \
 -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
 com.kmwllc.lucille.core.Worker \
 simple_pipeline

C) Start the Indexer

The indexer consumes documents from the Kafka destination topic and sends batches of processed documents to the configured search backend.

java \
 -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> \
 -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
 com.kmwllc.lucille.core.Indexer \
 simple_pipeline

What this Does

  • -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> tells Lucille where to find your configuration.
  • -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' loads Lucille and its dependencies.
  • com.kmwllc.lucille.core.Runner -useKafka starts the run and interacts with Kafka as described above.
  • com.kmwllc.lucille.core.Worker <pipelineName> processes documents through the configured pipeline as described above.
  • com.kmwllc.lucille.core.Indexer <pipelineName> writes processed documents to the configured backend as described above.

Trouble Running Lucille?

See the troubleshooting guide for common pitfalls.

Verifying Your Lucille Run

  • Logs: You should see Lucille start up, load your configuration, report component initialization, record counts, and completion status.

    During the run, you will see throughput and latency metrics like:

    25/10/31 13:40:21 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO WorkerPool: 27017 docs processed. One minute rate: 1787.10 docs/sec. Mean pipeline latency: 10.63 ms/doc.
    25/10/31 13:40:22 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO PublisherImpl: 37029 docs published. One minute rate: 3225.69 docs/sec. Mean connector latency: 0.00 ms/doc. Waiting on 21014 docs.
    25/10/31 13:40:22 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Indexer: 17016 docs indexed. One minute rate: 455.07 docs/sec. Mean backend latency: 6.90 ms/doc.
    

    At completion, Lucille prints a stage-by-stage performance summary and a final run result:

    25/10/31 13:46:47 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Stage: Stage test_source metrics. Docs processed: 200000. Mean latency: 0.0003 ms/doc. Children: 0. Errors: 0.
    25/10/31 13:46:47 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Stage: Stage test_summary metrics. Docs processed: 200000. Mean latency: 0.3532 ms/doc. Children: 0. Errors: 0.
    25/10/31 13:46:47 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Runner: 
    RUN SUMMARY: Success. 1/1 connectors complete. All published docs succeeded.
    connector1: complete. 200000 docs succeeded. 0 docs failed. 0 docs dropped. Time: 416.47 secs.
    25/10/31 13:46:47 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Runner: Run took 417.46 secs.
    
  • Output: View your target service (e.g., Elasticsearch) to verify your index.

2.1 - Installation

A guide to installing Lucille locally.

Prerequisites

To build and run Lucille from source, you need:

  • Java 17+ JDK (not just a JRE)
  • Maven (recent version)

Java Setup (JDK 17+ Required)

Important: Before running any Lucille commands, make sure JAVA_HOME points to a JDK 17+ (not just a JRE) and that $JAVA_HOME/bin is on your PATH (or %JAVA_HOME%\bin on Windows). Maven and the java launcher rely on this.

Verify Java

java -version

You should see version 17 (or newer). If it’s missing or older than 17 install a JDK 17+ using one of the options below.

Install Options

Package manager

  • macOS (Homebrew)
    brew install openjdk@17
    
  • Windows (Chocolatey)
    choco install microsoft-openjdk17
    

Vendor installer

  • Download a JDK 17+ installer from a vendor such as Oracle JDK.
  • Run the installer, then set JAVA_HOME as shown below.

Set JAVA_HOME and PATH

macOS

export JAVA_HOME="$(/usr/libexec/java_home -v 17)"
export PATH="$JAVA_HOME/bin:$PATH"

Windows

  • Open System Properties, Environment Variables.
  • Create/Edit JAVA_HOME and point it to your JDK folder.
  • Edit Path and add %JAVA_HOME%\bin above other Java entries.

Maven Setup

mvn -v

You should see a recent Maven version and your Java home. If mvn is not found, install Maven using one of the options below.

Install Options

Package manager

  • macOS (Homebrew)
    brew install maven
    
  • Windows (Chocolatey)
    choco install maven
    

Binary installer

  • Download the binary zip/tar for Apache Maven from the official website.
  • Add Maven’s bin/ to your PATH.

macOS

export PATH="<maven-dir>/bin:$PATH"

Windows

  • Open System Properties, Environment Variables.
  • Edit Path and add <maven-dir>/bin.

Clone the Repository

git clone https://github.com/kmwtechnology/lucille.git

Build Lucille

cd lucille
mvn clean install

This compiles all modules and produces build artifacts under each module’s target/ folder.

3 - Architecture

Understanding Lucille’s core components & topology.

3.1 - Topology

Lucille can be configured to best support your use case.

Local Development Modes

Alternatively referred to as standalone modes or single-node modes. In local mode, all Lucille components are running in a single thread Java Process (JVM).
Lucille supports two types of local modes: Local or Kafka Local.

Local

A single process in the JVM. Intra-component (thread) communication is via an in-memory queue.

An architecture diagram showing local mode.

Kakfa Local

A single process in the JVM, with the exception of an externalized instance of Kafka. Intra-component (thread) communication is via Kafka.

An architecture diagram showing kafka local mode.

Distributed Modes

Distributed modes are how Lucille scales. Kafka is utilized for message persistance/fault tolerance, and Lucille components run as multiple separate Java processes.

Fully Distributed

Best for batch ingest architecture.

Connector/Publisher, Workers, and Indexers are all separate processes running in their own JVM. Intra-component communication is via externalized Kafka topics/queues. Events are being sent from the workers and indexers to a Kafka event topic. The events are being read by the Lucille Publisher.

An architecture diagram showing fully distributed mode.

Connector-less Distributed

Best for incremental ingest architecture.

Similar to Fully Distributed Mode, but as the name implies, there is no Lucille Connector or Publisher. Instead, this mode supports an arbitrary 3rd party “publisher” feeding documents onto a Kafka Source queue - where the documents are then picked up by worker threads.

In this mode, events may be enabled and sent to a Kafka Event queue. In order to send events, the 3rd party publisher would be responsible for creating a run ID and stamping documents with this ID.

An architecture diagram showing connectorless distributed mode.

Hybrid

Best for streaming update architecture.

Similar to Connector-less Distibuted Mode. However, in hybrid mode worker threads are reading from a source Kafka topic and publishing to an in-memory queue read by a worker-indexer (an indexer paired with a worker).

An architecture diagram showing hybrid mode.

3.2 - JavaDocs

Structured documentation describing classes, methods, and fields to help developers understand and use the code effectively.

Javadocs for the Lucille Project are available here.

3.3 - Components

A reference guide for understanding and using the core components of Lucille.

3.3.1 - Document

The basic unit of data that is sent through a Pipeline and eventually indexed into a search engine.

A Lucille Document is the basic unit of data that gets sent through a pipeline and indexed.

Documents are set of named fields which may contain single or list values represented in JSON. Each Document has a unique id.

3.3.2 - Events

As Lucille runs, it generates Events that inform on success, failure, etc.

Lucille Events

As a document passes through various stages of the Lucille ETL pipeline, and as the documents are handled by the indexer, Event messages are generated.

Connectors listen for these events to ensure that all of the documents sent for processing are successfully processed and accounted for.

Errors during processing are reported and tracked via these event messages so the connector can report back the overall success or failure of the execution.

Event Topics

A Kafka Topic that contains event messages. The event messages are sent from stages in the Lucille Pipeline as well as from the indexer.

The event topic name is based on a run ID which is stamped on documents that are flowing through the system. Whenever events are enabled, documents need to have a run ID on them.

In the case where there is no runner to create the run ID and no Lucille publisher to stamp that run ID on the documents, the “third party publisher” that is putting document json onto Kafka would need to include a run ID on those documents. It could choose its own run ID to use.

3.3.3 - Publisher

Provides a way to publish Documents for processing by the pipeline.

Publisher provides a way to publish documents for processing by the pipeline. When published, a Lucille document becomes available for consumption by any active pipeline Worker.

The Publisher is aware of every document in the run that needs to be indexed, and determines when to end a run by reading all the specific document events.

Publisher also:

  • Is responsible for stamping a designated run_id on each published document and maintaining accounting details specific to a run.
  • Accepts incoming events related to published documents and their children (via in-memory queue or Kafka Event Topic).

3.3.4 - Runner

Component that manages a Lucille Run, end-to-end.

When you run Lucille at the command line in standalone mode, you are invoking the Runner. When invoked, the runner reads the configuration file and then begins a Lucille run by launching the appropriate connector(s) and publisher. The runner generates a runId per Lucille run and terminates based on messages sent by the Publisher.

What the runner invokes can be thought of is an end to end Lucille Run.

A Lucille Run is a sequence of connectors to be run, one after the other. Each connectors feeds to a specific pipeline. A run can include multiple connectors feeding multiple pipelines.

3.3.5 - Worker

A thread that retrieves a published document and passes it through the pipeline, then sending completed documents to a destination queue.

3.3.6 - Config

The Config is a HOCON file where you define the settings for running Lucille.

Lucille Configuration

When you run Lucille, you provide a path to a file which provides configuration for your run. Configuration (Config) files use HOCON, a superset of JSON. This file defines all the components in your Lucille run.

Quick references

A complete config file must contain three elements (Connector(s), Pipeline(s), Indexer):

Connectors

Connectors read data from a source and emit it as a sequence of individual Documents, which will then be sent to a Pipeline for enrichment.

connectors should be populated with a list of Connector configurations.

See Connectors for more information about configuring Connectors.

Pipeline and Stages

A pipeline is a list of Stages that will be applied to incoming Documents, preparing them for indexing. As each Connector executes, the Documents it publishes can be processed by a Pipeline, made up of Stages.

pipelines should be populated with a list of Pipeline configurations. Each Pipeline needs two values: name, the name of the Pipeline, and stages, a list of the Stages to use. Multiple connectors may feed to the same Pipeline.

See Stages for more information about configuring Stages.

Indexer

An indexer sends processed Documents to a specific destination. Only one Indexer can be defined; all pipelines will feed to the same Indexer.

A full indexer configuration has two separate config blocks: first, the generic indexer configuration, and second, configuration for the specific indexer used in your run. For example, to use the SolrIndexer, you provide separate indexer and solr config blocks.

See Indexers for more information about configuring your Indexer.

Other Run Configuration

In addition to those three elements, you can also configure other parts of a Lucille run.

  • publisher - Define the queueCapacity.
  • log
  • runner
  • zookeeper
  • kafka - Provide a consumerPropertyFile, producerPropertyFile, adminPropertyFile, and other configuration.
  • worker - Control how many threads you want, the maxRetries in Zookeeper, and more.

Validation

Lucille validates the Config you provide for Connectors, Stages, and Indexers. For example, in a Stage, if you provide a property the Stage does not use, an Exception will be thrown. An Exception will also be thrown if you do not provide a property required by the Stage.

If you want to validate your config file without starting an actual run, you can use our command-line validation tool. Just add -validate to the end of your command executing Lucille. The errors with your config will be printed out to the console, and no actual run will take place.

Config Validation

Lucille components (like Stages, Indexers, and Connectors) each take in a set of specific arguments to configure the component correctly. Sometimes, certain properties are required - like the pathsToStorage for your FileConnector traversal, or the path for your CSVIndexer. Other properties are optional / do not always have to be specified.

For these components, developers must declare a Spec which defines the properties that are required or optional. They must also declare what type each property is (number, boolean, string, etc.). For example, the SequenceConnector requires you to specify the numDocs you want to create, and optionally, the number you want IDs to startWith. So, the Spec looks like this:

public static final Spec SPEC = SpecBuilder.connector()
      .requiredNumber("numDocs")
      .optionalNumber("startWith")
      .build();

Declaring a Spec

Lucille is designed to access Specs reflectively. If you build a Stage/Indexer/Connector/File Handler, you need to declare a public static Spec named SPEC (exactly). Failure to do so will not result in a compile-time error. However, you will not be able to instantiate your component - even in unit tests - as the reflective access (which takes place in the super / abstract class) will always fail.

When you declare the public static Spec SPEC, you’ll want to call the appropriate SpecBuilder method which provides appropriate default arguments for your component. For example, if you are building a Stage, you should call SpecBuilder.stage(), which allows the config to include name, class, conditions, and conditionPolicy.

Lists and Objects

Validating a list / object is a bit tricky. When you declare a required / optional list or object in a Config, you can either provide:

  1. A TypeReference describing what the unwrapped List/Object should deserialize/cast to.
  2. A Spec, for a list, or a named Spec (created via SpecBuilder.parent()), for an object, describing the valid properties. (Use a Spec for a list when you need a List<Config> with specific structure. For example, Stage conditions.)

Parent / Child Validation

Some configs include properties which are objects, containing even more properties. For example, in the FileConnector, you can specify fileOptions - which includes a variety of additional arguments, like getFileContent, handleArchivedFiles, and more. This is defined in a parent Spec, created via SpecBuilder.parent(), which has a name (the key the config is held under) and has its own required/optional properties. The fileOptions parent Spec is:

SpecBuilder.parent("fileOptions")
  .optionalBoolean("getFileContent", "handleArchivedFiles", "handleCompressedFiles")
  .optionalString("moveToAfterProcessing", "moveToErrorFolder").build();

A parent Spec can be either required or optional. When the parent is present, its properties will be validated against this parent Spec.

There will be times that you can’t know what the field names would be in advance. For example, a field mapping of some kind. In this case, you should pass in a TypeReference describing what type the unwrapped ConfigObject should deserialize/cast to. For example, if you want a field mapping of Strings to Strings, you’d pass in new TypeReference<Map<String, String>>(){}. In general, you should use a Spec when you know field names, and a TypeReference when you don’t.

Why Validate?

Obviously, you will get an error if you call config.getString("field") when "field" is not specified. However, additional validation on Configs is still useful/necessary for two primary reasons:

  1. Command-line utility

We want to allow the command-line validation utility to provide a comprehensive list of a Config’s errors. As such, Lucille has to validate the config before a Stage/Connector/Indexer begins accessing properties and potentially throwing a ConfigException.

  1. Prevent typos from ruining your pipeline

A mistyped field name could have massive ripple effects throughout your pipeline. As such, each Stage/Connector/Indexer needs to have a specific set of legal Config properties, so Exceptions can be raised for unknown or unrecognized properties.

3.3.7 - Indexer

An Indexer sends processed Documents to a specific destination.

Indexers

An Indexer is a thread that retrieves processed Documents from the end of a Pipeline and sends them in batches to a specific destination. For users of Lucille, this destination will most commonly be a search engine.

Only one Indexer can be defined in a Lucille run. All pipelines will feed to the same Indexer.

Indexer configuration has two parts:

  • the generic indexer configuration

  • configuration for the implementation you are using.

  • For example, if you are using Solr, you’d provide solr config, or elastic for Elasticsearch, csv for CSV, etc.

Here’s what using the SolrIndexer might look like:

# Generic indexer config
indexer {
  type: "solr"
  ignoreFields: ["city_temp"]
  batchSize: 100
}
# Specific implementation (Solr) config
solr {
  useCloudClient: true
  url: "localhost:9200"
  defaultCollection: "test_index"
}

At a minimum, indexer must contain either type or class. type is shorthand for an indexer provided by lucille-core - it can be "Solr", "OpenSearch", "ElasticSearch", or "CSV". indexer can contain a variety of additional properties as well. Some Indexers do not support certain properties, however. For example, OpenSearchIndexer and ElasticsearchIndexer do not support indexer.indexOverrideField.

The lucille-core module contains a number of commonly used indexers. Additional indexers with a large number of dependencies are provided as optional plugin modules.

Lucille Indexers (Core)

Lucille Indexers (Plugins)

3.3.8 - Stages

A Stage performs a specific transformation on a Document.

Lucille Stages

Stages are the building blocks of a Lucille pipeline. Each Stage performs a specific transformation on a Document.

Lucille Stages should have JavaDocs that describe their purpose and the parameters acceptable in their Config. On this site, you’ll find more in-depth documentation for some more advanced / complex Lucille Stages.

To configure a stage, you have to provide its class (under class) in its config. You can also specify a name for the Stage as well, in addition to conditions and conditionPolicy (described below).

You’ll also provide the parameters needed by the Stage as well. For example, the AddRandomBoolean Stage accepts two optional parameters - field_name and percent_true. So, an AddRandomBoolean Config would look something like this:

{
  name: "AddRandomBoolean-First"
  class: "com.kmwllc.lucille.stage.AddRandomBoolean"
  field_name: "rand_bool_1"
  percent_true: 65
}

Conditions

For any Stage, you can specify “conditions” in its Config, controlling when the Stage will process a Document. Each condition has a required parameter, fields, and two optional parameters, operator and values.

  • fields is a list of field names that will determine whether the Stage applies to a Document.

  • values is a list of values that the conditional fields will be searched for. (If not specified, only the existence of fields is checked.)

  • operator is either "must" or "must_not" (defaults to "must").

In the root of the Stage’s Config, you can also specify a conditionPolicy - either "any" or "all", specifying whether any or all of your conditions must be met for the Stage to process a Document. (Defaults to "any".)

Let’s say we are running the Print Stage, but we only want it to execute on a Document where city = Boston or city = New York. Our Config for this Stage would look something like this:

{
name: "print-1"
class: "com.kmwllc.lucille.stage.Print"
conditions: [
  {
    fields: ["city"]
    values: ["Boston", "New York"]
  }
]
}

3.3.8.1 - EmbeddedPython

Run a document through a Java embedded Graal Python environment.

Why Use It?

EmbeddedPython executes per-document Python code inside the Lucille JVM using GraalPy. Instead of returning a JSON object, your script mutates the current document directly through a Python-friendly proxy bound as doc (and the raw Java document as rawDoc). This avoids ports, subprocesses, venvs, and per-document JSON round trips.

When To Use It

Use EmbeddedPython when you need one or more of the following:

  • Minimal operational overhead (ports, subprocess lifecycle, venv creation, pip installs).
  • No use of any external Python libraries or native dependencies that require a real Python environment.
  • Lightweight field enrichment/transformation.

When To Use ExternalPython Instead

Avoid EmbeddedPython and use ExternalPython when you need one or more of the following:

  • Real Python compatibility (including packages with native dependencies).
  • Dependency management via a requirements.txt installed into a managed venv.
  • Process isolation apart from the JVM.

Example

Input Document

{
  "id": "doc-1",
  "title": "Hello",
  "author": "Test",
  "views": 123
}

Python Script

doc["title"] = doc["title"].upper()

Output Document

{
  "id": "doc-1",
  "title": "HELLO",
  "author": "Test",
  "views": 123
}

Config Parameters

{
 name: "EmbeddedPython-Example"
 class: "com.kmwllc.lucille.stage.EmbeddedPython"

 # Specify exactly one of the following:
 script_path: "/path/to/my_script.py"
 script: "doc['title'] = doc['title'].upper()"
}

3.3.8.2 - ExternalPython

Run a document through an external Py4J Python environment.

Why Use It?

ExternalPython delegates per-document processing to an external Python process using Py4J. Lucille serializes the Document into a request, calls a Python function, receives a JSON response, and applies that response back onto the document.

When To Use It

Use ExternalPython when you need one or more of the following:

  • Real Python compatibility (including packages with native dependencies).
  • Dependency management via a requirements.txt installed into a managed venv.
  • Process isolation apart from the JVM.

When To Use EmbeddedPython Instead

Avoid ExternalPython and use EmbeddedPython when you need one or more of the following:

  • Minimal operational overhead (ports, subprocess lifecycle, venv creation, pip installs).
  • No use of any external Python libraries or native dependencies that require a real Python environment.
  • Lightweight field enrichment/transformation.

Restrictions

Your python file must be in one of the following directories that start in the current working directory that is running lucille:

  • ./python
  • ./src/main/resources
  • ./src/test/resources
  • ./src/test/resources/ExternalPythonTest (for testing)

Example

Input Document

{
  "id": "doc-1",
  "title": "Hello",
  "author": "Test",
  "views": 123
}

Python Script

def process_document(doc):
    title = doc["title"]
  
    return {
        "title": title.upper()
    }

Python Returns

{
  "title": "HELLO"
}

Output Document

{
  "id": "doc-1",
  "title": "HELLO"
}

Config Parameters

{
  name: "ExternalPython-Example"
  class: "com.kmwllc.lucille.stage.ExternalPython"

  scriptPath: "/path/to/my_script.py"

  # Optional
  pythonExecutable: "python3"
  requirementsPath: "/path/to/requirements.txt"
  functionName: "process_document"
  port: 25333
}

Example (NumPy)

Input Document

{
  "id": "doc-2",
  "values": [1, 2, 3, 4, 5]
}

Python Script

import numpy as np

def process_document(doc):
    arr = np.array(doc["values"], dtype=float)
  
    return {
        "values": doc["values"],
        "mean": float(np.mean(arr)),
        "stddev": float(np.std(arr))
    }

Output Document

{
  "id": "doc-2",
  "values": [1, 2, 3, 4, 5],
  "mean": 3.0,
  "stddev": 1.41
}

requirements.txt

numpy

Config Parameters

{
  name: "ExternalPython-Numpy"
  class: "com.kmwllc.lucille.stage.ExternalPython"

  scriptPath: "/path/to/my_numpy_script.py"
  requirementsPath: "/path/to/requirements.txt"
  
  # Optional
  pythonExecutable: "python3"
  functionName: "process_document"
  port: 25333
}

3.3.8.3 - PromptOllama

Connect to Ollama Server and send a Document to an LLM for enrichment.

What if you could just, actually, put an LLM on everything?

Ollama

Ollama allows you to run a variety of Large Language Models (LLMs) with minimal setup. You can also create custom models using Modelfiles and system prompts.

The PromptOllama Stage allows you to connect to a running instance of Ollama Server, which communicates with an LLM through a simple API. The Stage sends part (or all) of a Document to the LLM for generic enrichment. You’ll want to create a custom model (with a Modelfile) or provide a System Prompt in the Stage Config that is tailored to your pipeline.

We strongly recommend you have the LLM output only a JSON object for two main reasons: Firstly, LLMs tend to follow instructions better when instructed to do so. Secondly, Lucille can then parse the JSON response and fully integrate it into your Document.

Example

Let’s say you are working with Documents which represent emails, and you want to monitor them for potential signs of fraud. Lucille doesn’t have a DetectFraud Stage (at time of writing), but you can use PromptOllama to add this information with an LLM.

  • Modelfile: Let’s say you created a custom model, fraud_detector, in your instance of Ollama Server. As part of the modelfile, you instruct the model to check the contents for fraud and output a JSON object containing just a boolean value (under fraud). Your Stage would be configured like so:
{
  name: "Ollama-Fraud"
  class: "com.kmwllc.lucille.stage.PromptOllama"
  hostURL: "http://localhost:9200"
  modelName: "fraud_detector"
  fields: ["email_text"]
}
  • System Prompt: You can also just reference a specific LLM directly, and provide a system prompt in the Stage configuration.
{
  name: "Ollama-Fraud"
  class: "com.kmwllc.lucille.stage.PromptOllama"
  hostURL: "http://localhost:9200"
  modelName: "gemma3"
  systemPrompt: "You are to read the text inside \"email_text\" and output a JSON object containing only one field, fraud, a boolean, representing whether the text contains evidence of fraud or not."
  fields: "email_text"
}

Regardless of the approach you choose, the LLM will receive a request that looks like this:

{
  "email_text": "Let's be sure to juice the numbers in our next quarterly earnings report."
}

(Since fields: ["email_text"], any other fields on this Document are not part of the request.)

And the response from the LLM should look like this:

{
  "fraud": true
}

Lucille will then add all key-value pairs in this response JSON into your Document. So, the Document will become:

{
  "id": "emails.csv-85",
  "run-id": "f9538992-5900-459a-90ce-2e8e1a85695c",
  "email_text": "Let's be sure to juice the numbers in our next quarterly earnings report.",
  "fraud": true
}

As you can see, PromptOllama is very versatile, and can be used to enrich your Documents in a lot of ways.

3.3.8.4 - QueryOpensearch

Execute an OpenSearch Template using information from a Document, and add the response to it.

OpenSearch Templates

You can use templates in OpenSearch to repeatedly run a certain query using different parameters. For example, if we have an index full of parks, and we want to search for a certain park, we might use a template like this:

{
  "source": {
    "query": {
      "match_phrase": {
        "park_name": "{{park_to_search}}"
      }
    }
  }
}

In Opensearch, you could then call this template (providing it park_to_search) instead of writing out the full query each time you want to search.

Templates can also have default values. For example, if you want park_to_search to default to “Central Park” when a value is not provided, it would be written as: "park_name": "{{park_to_search}}{{^park_to_search}}Central Park{{/park_to_search}}"

QueryOpensearch Stage

The QueryOpensearch Stage executes a search template using certain fields from a Document as your parameters and adding OpenSearch’s response to the Document. You’ll specify either templateName, the name of a search template you’ve saved, or searchTemplate, the template you want to execute, in your Config.

You’ll also need to specify the names of parameters in your search template. These will need to match the names of fields on your Documents. If your names don’t match, you can use the RenameFields Stage first.

In particular, you have to specify which parameters are required and which are optional. If a required name in requiredParamNames is missing from a Document, an Exception will be thrown, and the template will not be executed. If an optional name in optionalParamNames is missing they (naturally) won’t be part of the template execution, so the default value will be used by OpenSearch.

If a parameter without a default value is missing, OpenSearch doesn’t throw an Exception - it just returns an empty response with zero hits. So, it is very important that requiredParamNames and optionalParamNames are defined very carefully!

3.3.9 - File Handlers

File Handlers extract Lucille Documents from individual files, like CSV or JSON files, which themselves contain data which can be transformed into Lucille Documents.

File Handlers accept an InputStream for processing, and return the Documents they extract in an Iterator. The provided InputStream and any other underlying resources are closed when the Iterator returns false for hasNext(). As such, when working directly with these File Handlers, it is important to exhaust the Iterators they return.

File Handlers (Core)

Custom File Handlers

Developers can implement and use custom File Handlers as needed. Extend BaseFileHandler to get started. To use a custom FileHandler, you have to reference its class in its Config. This is not needed when using the File Handlers provided by Lucille. You can override the File Handlers provided by Lucille, as well - just include the class you want to use in the Config.

3.3.10 - Connectors

A component that retrieves data from a source system and packages the data into Documents in preparation for transformation.

Lucille Connectors

Lucille Connectors are components that retrieve data from a source system, packages the data into “Documents”, and publishes them to a pipeline.

To configure a Connector, you have to provide its class (under class) in its config. You also need to specify a name for the Connector. Optionally, you can specify the pipeline, a docIdPrefix, and whether the Connector requires a Publisher to collapse.

You’ll also provide the parameters needed by the Connector as well. For example, the SequenceConnector requires one parameter, numDocs, and accepts an optional parameter, startWith. So, a SequenceConnector Config would look something like this:

{
  name: "Sequence-Connector-1"
  class: "com.kmwllc.lucille.connector.SequenceConnector"
  docIdPrefix: "sequence-connector-1-"
  pipeline: "pipeline1"
  numDocs: 500
  startWith: 50
}

The lucille-core module contains a number of commonly used connectors. Additional connectors with a large number of dependencies are provided as optional plugin modules.

Lucille Connectors (Core)

The following connectors are deprecated. Use FileConnector instead, along with a corresponding FileHandler.

Lucille Connectors (Plugins)

3.3.10.1 - RSS Connector

A Connector that publishes Documents representing items found in an RSS feed.

The RSSConnector

The RSSConnector allows you to publish Documents representing the items found in an RSS feed of your choice. Each Document will (optionally) contain fields from the RSS items, like the author, description, title, etc. By default, the Document IDs will be the item’s guid, which should be a unique identifier for the RSS item.

You can configure the RSSConnector to only publish recent RSS items, based on the pubDate found on the items. Also, it can run incrementally, refreshing the RSS feed after a certain amount of time until you manually stop it. The RSSConnector will avoid publishing Documents for the same RSS item more than once.

The Documents published may have any of the following fields, depending on how the RSS feed is structured:

  • author (String)
  • categories (List<String>)
  • comments (List<String>)
  • content (String)
  • description (String)
  • enclosures (List<JsonNode>). Each JsonNode contains:
    • type (String)
    • url (String)
    • May contain length (Long)
  • guid (String)
  • isPermaLink (Boolean)
  • link (String)
  • title (String)
  • pubDate (Instant)

3.3.10.2 - File Connector

A Connector that, given a path to S3, Azure, Google Cloud, or the local file system, traverses the content at the given path and publishes Lucille documents representing its findings.

Source Code

The file connector traverses a file system and publishes Lucille documents representing its findings. In your Configuration, specify pathsToStorage, representing the path(s) you want to traverse. Each path can be a path to the local file system or a URI for a supported cloud provider.

Working with Cloud Storage

When you are providing FileConnector with URIs to cloud storage, you also need to apply the appropriate configuration for any cloud providers used. For each provider, you’ll need to provide a form of authentication; you can optionally specify the maximum number of files (maxNumOfPages) that Lucille will load into memory for a given request.

  • Azure: Specify the needed options in azure in your Config. You must provide connectionString, or you must provide accountName and accountKey.
  • Google: Specify the needed options in gcp in your Config. You must provide pathToServiceKey.
  • S3: Specify the needed options in s3 in your Config. You must provide accessKeyId, secretAccessKey, and region. For URIs, pathsToStorage must be percent-encoded for special characters (e.g., s3://test/folder%20with%20spaces).
  • For each of these providers, in their configuration, you can optionally include maxNumOfPages as well.

Applying FileHandlers

Some of the files that your FileConnector encounters will, themselves, contain data that you want to extract more documents from! For example, the FileConnector may encounter a .csv file, where each row itself represents a Document to be published. This is where FileHandlers come in - they will individually process these files and create more Lucille documents from their data. See File Handlers for more.

In order to use File Handlers, you need to specify the appropriate configuration within your Config - specifically, each File Handler you want to use will be a map within the fileOptions you specify. You can use csv, json, or xml. See the documentation for each File Handler to see what arguments are needed / accepted.

File Options

File options determine how you handle and process files you encounter during a traversal. Some commonly used options include:

  • getFileContent: Whether, during traversal, the FileConnector should add an array of bytes representing the file’s contents to the Lucille document it publishes. This will slow down traversal significantly and is resource intensive. On the cloud, this will download the file contents.
  • handleArchivedFiles/handleCompressedFiles: Whether you want to handle archive or compressed files, respectively, during your traversal. For cloud files, this will download the file’s contents.
  • moveToAfterProcessing: A path to move files to after processing.
  • moveToErrorFolder: A path to move files to if an error occurs.

Filter Options

Filter options determine which files will/won’t be processed & published in your traversal. All filter options are optional. If you specify multiple filter options, files must comply with all of them to be processed & published.

  • includes: A list of patterns for the only file names that you want to include in your traversal.
  • excludes: A list of patterns for file names that you want to exclude from your traversal.
  • lastModifiedCutoff: Filter out files that haven’t been modified recently. For example, specify "1h", and only files modified within the last hour will be processed & published.
  • lastPublishedCutoff: Filter out files that were recently published by Lucille. For example, specify "1h", and only files published by Lucille more than an hour ago (or never published) will be processed & published. Requires you to provide state configuration, otherwise, it will not be enforced!

State

The File Connector can keep track of when files were last known to be published by Lucille. This allows you to use FilterOptions.lastPublishedCutoff and avoid repeatedly publishing the same files in a short period of time.

In order to use state with the File Connector, you’ll need to configure a connection to a JDBC-compatible database. The database can be embedded, or it can be remote.

It’s important to note that File Connector state is designed to be efficient and lightweight. As such, keep a few points in mind:

  1. Files that were recently moved / renamed files will not have the lastPublishedCutoff applied.
  2. In your File Connector configuration, it is important that you consistently capitalize directory names in your pathToStorage, if you are using state.
  3. Each database table should be used for only one connector configuration.

3.3.11 - Pipeline

The end-to-end sequence of stages that transform Documents.

A Pipeline refers to the complete sequence of stages that can be configured to transform documents with Lucille.

In most the most common architecture, a Pipeline is fed by a Connector and emits documents to an Indexer.

4 - Contribution Guidelines

How to develop new features and coding standards.

4.1 - Setup & Standards

Coding standards for Lucille and how to set up for local development.

Local Developer Setup

Prerequisite(s):

  • IntelliJ application installed on machine
  • Java project

Setting up Google Code Formatting Scheme

  • Make sure that Intellij is open
  • Go to the following link: styleguide/intellij-java-google-style.xml at gh-pages · google/styleguide
  • Download the .xml file
  • Open the file in an editor of your choice
  • Navigate to the <option …> tag with name ‘Right Margin’ and edit the value to be 132 (it should default as 100)
  • Save the file
  • In Intellij IDEA, navigate to Settings | Preferences → Code Style → Editor → Java
  • Click on the gear icon on the right panel and drill down to the option Import Scheme and then to Intellij IDEA Code Style XML
  • In the file explorer that opens, navigate to where you stored the aforementioned .xml file we downloaded
  • After selecting the file, you should see a pop-up allowing you to name the scheme; select a name and click ‘Okay’
  • Click ‘Apply’ in the Settings panel
  • Restart the IDE; You can use the ‘Reformat Code’ option to apply the plug-in on your code

Excluding Non-Java Files

Assuming that we don’t want to auto-format non-java files via a directory level ‘Reformat Code’ option, we need to exclude all other files from being reformatted

  • Navigate to Settings | Preferences in Intellij IDEA

  • Navigate to Editor → Code Style

  • Click on the tab on the right window labeled ‘Formatter’

  • In the ‘Do Not Format’ text box, paste the following and click ‘Apply'

    *.{yml,xml,md,json,yaml,jsonl,sql}

  • A restart of Intellij may be required to see changes

This method may prove to be too complicated, especially when new file types are added to the codebase, therefore, consider the following, simpler method instead:

  • When clicking on ‘Reformat Code’ at the directory level, a window will pop up
  • Under the filter sections in the window, select the ‘File Mask(s)’ option and set the value to ‘*.java’
  • This will INCLUDE all .java files in your reformatting

Eclipse Users

Eclipse import conf .xml files

The linked post details some useful information for how Eclipse users can use the same .xml for their code formatting on Eclipse IDE.

4.2 - Developing New Components

The basics of how to develop Connectors, Stages, and Indexers for Lucille.

Introduction

This guide covers the basics of how to develop new components for Lucille along with an understanding of required and optional components. After reading this, you should be able to start development with a good foundation on how testing works, how configuration is handled, and what features the base classes affords us.

Prerequisites

  • An up-to-date version of Lucille that has been appropriately installed

  • Understanding of Java programming language

  • Understanding of Lucille

Project and Package Layout

You can add Stages, Connectors, and Indexers to Lucille in many ways. All approaches require you to reference the component’s fully qualified class name (class = "...") in the config.

  1. Contribute to lucille-core (PR to core)

    • Stages: lucille/lucille-core/src/main/java/com/kmwllc/lucille/stage/
    • Connectors: lucille/lucille-core/src/main/java/com/kmwllc/lucille/connector/
    • Indexers: lucille/lucille-core/src/main/java/com/kmwllc/lucille/indexer/
    • Tests/resources mirror these under src/test/java and src/test/resources.
  2. Create a Lucille plugin (PR to lucille repo under lucille-plugins)

    • Example layout:
      lucille-plugins/
      my-plugin/
      src/main/java/com/kmwllc/lucille/my-plugin/stage/...
      src/main/java/com/name/lucille/my-plugin/connector/...
      src/main/java/com/name/lucille/my-plugin/indexer/...
      src/test/java/...
      src/test/resources/...
      pom.xml
      
  3. Use your own local code

    • Put classes anywhere in your own package; e.g., com.name.ingest.MyStage.
    • Requirement: Ensure the compiled JAR is on the classpath when running Lucille and reference the fully qualified class name in your config. *

Developing Stages

Stage Skeleton

Every Stage must expose a static SPEC that declares its config schema. Use SpecBuilder to define required/optional fields, lists, parents, and types. The base class consumes this to validate user config at load time. See the configuration docs for information on specs.

Every stage must follow the Javadoc Standards.

package com.kmwllc.lucille.stage;

import com.kmwllc.lucille.core.Stage;
import com.kmwllc.lucille.core.StageException;
import com.kmwllc.lucille.core.spec.Spec;
import com.kmwllc.lucille.core.spec.SpecBuilder;
import com.kmwllc.lucille.core.ConfigUtils;
import com.typesafe.config.Config;
import java.util.Iterator;
import com.kmwllc.lucille.core.Document;

/**
 * One‑line summary.
 * <p>
 * Config Parameters -
 * <ul>
 *   <li>foo (String, Required) : Description.</li>
 *   <li>bar (Integer, Optional) : Description. Defaults to 10.</li>
 * </ul>
 */
public class ExampleStage extends Stage {
  
  public static final Spec SPEC = SpecBuilder.stage()
      .requiredString("foo")
      .optionalNumber("bar")
      .build();
  
  private final String foo;
  private final int bar;

  public ExampleStage(Config config) throws StageException {
    super(config);
    this.foo = config.getString("foo");
    this.bar = ConfigUtils.getOrDefault(config, "bar", 10);
  }

  @Override
  public Iterator<Document> processDocument(Document doc) throws StageException {
    // mutate doc as needed
    doc.setField("out", foo + ":" + bar);
    // return null unless emitting child docs
    return null;
  }
}

Lifecycle Methods

  • start() for allocating resources and precomputing data structures.
  • processDocument(Document doc) for transforming the current document and (optionally) returning child docs.
  • stop() for releasing resources on shutdown.

Reading & Writing Fields

Lucille’s Document API supports single-valued and multi-valued fields with strong typing and convenience updaters.

Supported types: String, Boolean, Integer, Double, Float, Long, Instant, byte[], JsonNode, Timestamp, Date.

Getting Values

  • Single Value: getString(name), getInt(name), etc.
  • Lists: getStringList(name), getIntList(name), etc.
  • Nested JSON: getNestedJson("a.b[2].c") or getNestedJson(List<Segment>).

Writing Values

  • Overwrite (single-valued): setField(name, value) replaces any existing values and makes the field single valued.
  • Append (multi-valued): addToField(name, value) converts to a list if needed and appends.
  • Create or append: setOrAdd(name, value) creates as single-valued if missing, otherwise appends.

Updating Values

Use update(name, mode, values...):

  • OVERWRITE: first value overwrites, the rest append
  • APPEND: all values append
  • SKIP: no‑op if the field already exists

Nested JSON (Objects & Arrays)

  • Set: setNestedJson("a.b[2].c", jsonNode) or setNestedJson(List<Segment>, jsonNode).
  • Remove: removeNestedJson("a.b[2].c") removes the last segment from its parent.
  • Segments: Document.Segment.parse("a.b[2].c")Document.Segment.stringify(segments) helps convert between string paths and structured paths.

Unit Testing

See the Testing Standards.

Developing Connectors

Connector Skeleton

Every Connector must expose a static SPEC that declares its config schema. Use SpecBuilder to define required/optional fields, lists, parents, and types. The base class consumes this to validate user config at load time. See the configuration docs for information on specs.

Every connector must follow the Javadoc Standards.

package com.kmwllc.lucille.connector;

import com.kmwllc.lucille.core.ConnectorException;
import com.kmwllc.lucille.core.Document;
import com.kmwllc.lucille.core.Publisher;
import com.kmwllc.lucille.core.spec.Spec;
import com.kmwllc.lucille.core.spec.SpecBuilder;
import com.typesafe.config.Config;

/**
* One-line summary of what this Connector reads and how it emits Documents.
* <p>
* Config Parameters -
* <ul>
*   <li>sourceUri (String, Required) : Where to read from (file://, s3://, http://, etc.).</li>
*   <li>batchSize (Integer, Optional) : Max items to read before publishing a batch. Defaults to 100.</li>
* </ul>
*/
public class ExampleConnector extends AbstractConnector {

  public static final Spec SPEC = SpecBuilder.connector()
      .requiredString("sourceUri")
      .optionalNumber("batchSize")
      .build();

  private final String sourceUri;
  private final int batchSize;

  public ExampleConnector(Config config) {
    super(config);
    this.sourceUri = config.getString("sourceUri");
    this.batchSize = config.hasPath("batchSize") ? config.getInt("batchSize") : 100;
  }
  
  @Override
  public void execute(Publisher publisher) throws ConnectorException {
    // Read from sourceUri and publish Documents.
    for (int i = 0; i < batchSize; i++) {
      Document d = Document.create(createDocId("item-" + i));
      // Populate fields on d as needed, e.g.: d.setField("source_uri", sourceUri);
      try {
        publisher.publish(d);
      } catch (Exception e) {
        throw new ConnectorException("Failed to publish document " + d.getId(), e);
      }
    }
    
  @Override
  public void close() throws ConnectorException {
    // Optional: Close network or file handlers.
  }
}

Lifecycle & Behavior Tips

  • preExecute(runId) for preparing external connections.
  • execute(publisher) for reading from your source and call publisher.publish(doc) for each Document.
  • postExecute(runId) for optional cleanup or follow-up actions after execute completes successfully.
  • close() for releasing resources.

Unit Testing

See the Testing Standards.

Developing Indexers

Indexer Skeleton

Every Indexer must expose a static SPEC that declares its config schema. Use SpecBuilder to define required/optional fields, lists, parents, and types. The base class consumes this to validate user config at load time. See the configuration docs for information on specs.

Every indexer must follow the Javadoc Standards.

package com.kmwllc.lucille.indexer;

import com.kmwllc.lucille.core.Document;
import com.kmwllc.lucille.core.Indexer;
import com.kmwllc.lucille.core.ConfigUtils;
import com.kmwllc.lucille.core.spec.Spec;
import com.kmwllc.lucille.core.spec.SpecBuilder;
import com.kmwllc.lucille.message.IndexerMessenger;
import com.typesafe.config.Config;
import java.util.List;
import java.util.Set;
import org.apache.commons.lang3.tuple.Pair;

/**
 * One-line summary of what this Indexer does and where it sends documents. Additional details may go here as needed.
 * <p>
 * Config Parameters -
 * <ul>
 *   <li>url (String, Required) : Destination endpoint (e.g., base URL).</li>
 *   <li>index (String, Optional) : Default index/collection name. Defaults to "index1".</li>
 *   <li>batchSize (Integer, Optional) : Max docs per request. Defaults to 100.</li>
 * </ul>
 */
public class ExampleIndexer extends Indexer {
  
  public static final Spec SPEC = SpecBuilder.indexer()
      .requiredString("url")
      .optionalString("index")
      .optionalNumber("batchSize")
      .build();
  
  private final String url;
  private final String defaultIndex;
  private final int batchSize;
  
  public ExampleIndexer(Config config, IndexerMessenger messenger, String metricsPrefix, String localRunId) {
    super(config, messenger, metricsPrefix, localRunId);
    this.url = config.getString("url");
    this.defaultIndex = ConfigUtils.getOrDefault(config, "index", "index1");
    this.batchSize = ConfigUtils.getOrDefault(config, "batchSize", 100);
  }

  @Override
  public boolean validateConnection() {
    // Health check to the destination
    return true;
  }

  @Override
  protected Set<Pair<Document, String>> sendToIndex(List<Document> documents) throws Exception {
    // Send the batch using your destination client
    // Return any failed docs as pairs of (Document, reason)
    return Set.of();
  }

  @Override
  public void closeConnection() {
    // Close client resources
  }
}

Lifecycle Methods

  • validateConnection() for a quick destination availability check before starting the main loop.
  • sendToIndex(List<Document> docs) to perform a write and return any per-document failures.
  • closeConnection() for releasing resources on shutdown.

Unit Testing

See the Testing Standards.

Testing Standards

Test Layout & Naming

  • One test class per component (e.g., MyStageTest).
  • Group related assertions into focused test methods with descriptive names.
  • Place configs under a matching resources folder.

Locations

  • Tests:
lucille/lucille-core/src/test/java/com/kmwllc/lucille/<stage || indexer || connector>/
  • Per-test resources:
lucille/lucille-core/src/test/resources/<StageName || IndexerName || ConnectorName>Test/

General Testing Guidelines

  • Maximize coverage: Aim to cover as many branches, error paths, and edge cases as practical.
  • Fast and offline: No network or external services. Use mocks/spies only.
  • Exercise every parameter: Ensure each parameter is covered by at least one test path.
  • Test failures: Ensure bad configs, exceptions, empty inputs, etc. are tested.
  • Assert behavior: Prefer testing state/interactions over log output.
  • Time: Avoid sleeps to prevent longer test runs.
  • Configuration clarity: Aim to make test configuration explicit and readable. Use inline config factories, descriptive config names, and inline scripts where applicable.

JaCoCo Coverage Report

  • Run: mvn clean install.
  • Open report: lucille-core/target/jacoco-ut/index.html.
  • Interpretation: Summarizes test coverage across packages and classes, highlighting covered and missed lines and branches so you can see what executed during the test run at a glance.

Javadoc Standards

Lucille includes a small internal parser used during documentation builds to extract class-level Javadoc from components (Connectors, Stages, Indexers) and render their config fields in the UI. It runs as part of the docs generation tooling, not at runtime, and expects the exact formatting described below so the parameters can be displayed correctly. For reference, see the parser implementation.

Rules:

  • Put a clear description before the <p> tag (can be multi-sentence).
  • After <p>, include the literal heading Config Parameters - and a <ul> list.
  • Each item must be:
    • name (Type, Required | Optional) : Description.
    • Use exact casing.
    • Use escape generics (e.g., List<String>).
  • Don’t add extra blank lines. Keep consistent punctuation.

Template:

/**
 * Description of what this stage/connector/indexer does. This text can span
 * multiple sentences and be as long as you want as long as it appears before <p>.
 * <p>
 * Config Parameters -
 * <ul>
 *   <li>paramA (String, Required) : Example description.</li>
 *   <li>paramB (Integer, Optional) : Example description.</li>
 *   <li>flags (List&lt;String&gt;, Optional) : Example description.</li>
 *   <li>options (Map&lt;String, Object&gt;, Optional) : Example description.</li>
 * </ul>
 */

5 - Running Lucille in Production

Considering best practices with Lucille for deployments, logging, monitoring and more.

5.1 - Logging

Interpreting and using Lucille logs.

Lucille Logs

There are many ways you can use the logs output by Lucille. Lucille has, essentially, two main loggers for tracking your Lucille run: the Root logger, and the DocLogger.

The Root Logger

The Root logger outputs log statements from a variety of sources, allowing you to track your Lucille run. For example, the Root logger is where you’ll get intermittent updates about your pipeline’s performance, warnings from Stages or Indexers in certain situations, etc.

The Doc Logger

The DocLogger is very verbose - it tracks the lifecycle of each Document in your Lucille pipeline. For example, a log statement is made when a Document is created, before it is published, before & after a Stage operates on it… etc. As you can imagine, this results in many log statements - it is recommended these logs are stored in a file, rather than just having them printed to the console. Logs from the DocLogger will primarily be INFO-level logs - very rarely, an ERROR-level log will be made for a Document.

Log Files

Lucille can store logs in a file as plain text or as JSON objects. When storing logs as JSON, each line will be a JSON object representing a log statement in accordance with the EcsLayout. By modifying the log4j2.xml, you can control which Loggers are enabled/disabled, where their logs get stored, and what level of logs you want to process.

Logstash & OpenSearch

If you store your logs as JSON, you can easily run Logstash on the file(s), allowing you to index them into a Search Engine of your choice for enhanced discovery and analysis. This can be particularly informative when working with the DocLogger. For example, you might:

  • Trace a specific Document’s lifecycle by querying by the Document’s ID.
  • Using the Timestamp of the logs, track the performance of your Lucille pipeline and identify potential bottlenecks.
  • Create Dashboards, allowing you to monitor your pipeline for potential warnings / errors for a repeated Lucille run.

Here is an example pipeline.conf for ingesting your Lucille logs into a local OpenSearch instance:

input {
  file {
    path => "/lucille/lucille-examples/lucille-simple-csv-solr-example/log/com.kmwllc.lucille-json*"
    mode => "read"
    codec => "json"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    exit_after_read => "true"
  }
}
output {
  stdout {
    codec => rubydebug
  }
  opensearch {
    hosts => "http://localhost:9200"
    index => "logs"
    ssl_certificate_verification => false
    ssl => false
  }
}

Note that this pipeline will delete the log files after they are ingested. And, SSL is disabled.

Here are some queries you might run (using curl):

curl -XGET "http://localhost:9200/logs/_search" -H 'Content-Type: application/json' -d '{
  "query": {
    "term": {
      "id.keyword": "songs.csv-1"
    }
  }
}'

This query will only return log statements where the id of the Document being processed is “songs.csv-1”, a Document ID from the Lucille Simple CSV example. This allows you to easily track the lifecycle of the Document as it was processed and published.

curl -XGET "http://localhost:9200/logs/_search" -H 'Content-Type: application/json' -d '{
  "query": {
    "match": {
      "message": "FileHandler"        
    }
  }
}'

This query will return log statements with “FileHandler” in the message. This allows you to track specifically when Documents were created from a JSON, CSV, or XML FileHandler.

OpenSearch Dashboards

By filling an OpenSearch Index with your JSON log statements, you can also build OpenSearch Dashboards to monitor your Lucille runs and analyze their performance.

Setting up the Index Pattern

Once you have OpenSearch Dashboards open, click on the menu in the top left. Select Dashboards Management, and then, on this new page, select Index Patterns. Create an Index Pattern for your logs index. As you click through, be sure to choose @Timestamp in the dropdown for “Time Field”.

(Timestamps are very important in OpenSearch Dashboards. Many features will use the timestamp of a Document in some form.)

Discovery

OpenSearch Dashboards has two major “features” - Dashboards and Discover. We’ll start with Discover.

At the top right of the screen, you’ll see a time range. Only logs within this time range will be shown. It defaults to only include logs within the last 15 minutes, so you’ll likely need to change it. Set absolute time ranges that cover your Lucille run. (The UI for setting these times can be a bit finicky, but keep trying.)

Like before, we can trace the entire lifecycle of a single Document. In the search bar, type in: id:"songs.csv-1". Now, you should only see the log statements relevant to this Document. Each entry will be a large blob of text representing the entire Document. You can select certain fields on the left side of your screen to get a more focused view of each statement.

You can also sort the statements in a certain order. If you hover over Time at the top of the table, you can click the arrow to sort the logs by their timestamps.

Dashboards

Now, let’s see how Dashboards could help you monitor a certain Lucille run. Imagine you have scheduled a Lucille run to execute every day. We can create a Dashboard that will help us quickly see how many warnings/errors took place.

Click add in the top right to create a new panel. Click + Create new, then choose visualization, and then metric. Choose your logs index as the source.

In this new window, for metric, click buckets. Choose a Filters aggregation, and in the filter, include log.level:"WARN" (using Dashboards Query Language). This will display the number of log statements with a level of “WARN”.

A panel displaying the number of warnings.

You can then repeat this process, but for logs with a level of ERROR.

Now, let’s make a chart that’ll display the number of warnings per day. Create a new visualization - this time, a vertical bar. Go to buckets, and add X-axis. Configure it like this:

Configuration for the X-axis bucket.

Then, add another bucket - a split series. Configure it just like the panel - a Filters aggregation, with log.level:"WARN". Now you’ll have a chart tracking the number of warnings per day. And, again, you can do the same for ERROR-level logs.

Your dashboard might look a little something like this (but populated with a little bit of actual data):

A dashboard with four panels, the number of warnings, warnings per day, number of errors, and errors per day.

5.2 - Troubleshooting

A guide to some common issues and their resolution.

If something isn’t working, read logs first. Lucille logs are detailed and will usually identify where the problem is coming from.

Build & Java Environment Issues

Java Command Not Found / Wrong Version

Symptoms:

  • java: command not found.
  • Lucille fails with version errors.

Fix:

  • Ensure Java 17+ is installed and on PATH.
  • Ensure JAVA_HOME points to a JDK (not a JRE).
java -version
echo $JAVA_HOME

Example output:

java -version

java version "22.0.1" 2024-04-16
Java(TM) SE Runtime Environment (build 22.0.1+8-16)
Java HotSpot(TM) 64-Bit Server VM (build 22.0.1+8-16, mixed mode, sharing)

echo $JAVA_HOME

/Library/Java/JavaVirtualMachines/jdk-22.jdk/Contents/Home

If unset, refer to the installation guide.

Configuration Issues

Missing or Misspelled Keys

Symptoms:

  • Configuration does not contain key “name”.
  • Stage/Indexer throws for missing required property.

Fix:

  • Compare with component SPEC docs and ensure required fields exist and are correctly cased.

Invalid Types or List/Map Shapes

Symptoms:

  • Expected NUMBER got STRING.

Fix:

  • Match the SPEC type exactly and convert scalars to lists/maps when required.

Kafka / Messaging

Cannot Connect to Kafka

Symptoms:

  • Timed out waiting for connection from Kafka.
  • Connection to Kafka could not be established.

Fix:

  • Verify your kafka.bootstrapServers in your config.
  • Check Kafka connectivity to ensure it is reachable.

Solr / Elasticsearch / OpenSearch Connectivity

Cannot Connect to Solr / Elasticsearch / OpenSearch

Symptoms:

  • Failed to talk to the search cluster.
  • Connection refused or host not found.

Fix:

  • Ensure that the URL/port in your config is correct.
  • Ensure that the service is reachable from the machine running Lucille.

Certificate / Authorization Problems

Symptoms:

  • Handshake failure, certificate not trusted, or unauthorized.

Fix:

  • Test connection with auth/TLS disabled for debugging.
  • Verify that credentials are correct.

Schema / Mapping mismatches

Symptoms:

  • Bulk index partially fails and some documents are rejected.
  • Mapping or parsing exceptions.

Fix:

  • Read the exact field and reason in the error details and fix that field in your pipeline or mapping.

Timeouts

Lucille has a default timeout specified in runner.connectorTimeout of 24 hours. You can override this timeout in the configuration file. This may be necessary in the case of very large or long-running jobs that may take more than 24 hours to complete.

6 - Releases

Release Notes

Release Notes for all previous Lucille releases can be viewed in Github.