This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Getting Started

Understanding the basics to quickly get started.

Installation

See the installation guide to install prerequisites, clone the repository, and build Lucille.

Try it out

Lucille includes a few examples in the lucille-examples module to help you get started.

To see how to ingest the contents of a local CSV file into an instance of Apache Solr, refer to the simple-csv-solr-example.

To run this example, start an instance of Apache Solr on port 8983 and create a collection called quickstart. For more information about how to use Solr, see the Apache Solr Reference Guide).

Go to lucille-examples/lucille-simple-csv-solr-example in your working copy of Lucille and run:

mvn clean install

./scripts/run_ingest.sh

This script executes Lucille with a configuration file named simple-csv-solr-example.conf that tells Lucille to read a CSV of top songs and send each row as a document to Solr.

Run a commit with openSearcher=true on your quickstart collection to make the documents visible. Go to your Solr admin dashboard, execute a *:* query and you should see the songs from the source file now visible as Solr documents.

Quick Start Guide - Local Mode

Scope: The steps below run Lucille from a source build (built locally with Maven).

What is Local Mode?

Local mode runs all Lucille components (connector, pipeline, and indexer) inside a single JVM process that you start locally. Your configuration may still interact with external systems (e.g., S3, Solr, OpenSearch/Elasticsearch), but the Lucille runtime itself executes entirely within that single JVM.

Prepare a Configuration File

You’ll run Lucille by pointing it at a config file that declares your connectors, pipelines, and indexers. See the configuration docs for the full schema and supported components.

Run Lucille Locally

From the repository root, run the Runner with your config file:

java \
  -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> \
  -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
  com.kmwllc.lucille.core.Runner

What this Does

  • -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> tells Lucille where to find your configuration.
  • -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' loads Lucille and its dependencies.
  • com.kmwllc.lucille.core.Runner boots the Lucille engine in local mode and runs the configured pipeline to completion.

Trouble Running Lucille?

See the troubleshooting guide for common pitfalls.

Quick Start Guide - Distributed Mode

What is Distributed Mode?

Distributed mode allows you to scale Lucille to take advantage of available hardware by running each Lucille component in its own JVM and using Kafka for document transport and event tracking. You start:

  • A Runner (Publisher + Connectors) to publish documents onto Kafka.
  • One or more Workers to process the documents through a pipeline.
  • An Indexer to write the processed documents to your destination (Solr, OpenSearch, Elasticsearch, CSV, etc.).

This guide assumes Kafka and your destination system are already running and reachable. This guide focuses on running Lucille itself. For details on configuration structure and component options, see the corresponding docs.

Prepare a Configuration File

You’ll run Lucille by pointing it at a config file that declares your pipeline. See the configuration docs for the full schema and supported components.

Use a single config that defines: your connector(s), your pipeline(s), kafka configuration, and your indexer and its backend config (e.g., solr {}, opensearch {}, etc).

Start Components (Separate JVMs)

A) Start the Runner (publishes to Kafka)

The runner publishes documents to the Kafka source topic, listens for pipeline run events, logs run statistics, and waits for the run to complete.

java \
 -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> \
 -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
 com.kmwllc.lucille.core.Runner \
 -useKafka

B) Start one or more Workers

Each worker consumes documents from the Kafka source topic, processes each document through the configured pipeline, and writes the processed documents to the Kafka destination topic.

java \
 -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> \
 -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
 com.kmwllc.lucille.core.Worker \
 simple_pipeline

C) Start the Indexer

The indexer consumes documents from the Kafka destination topic and sends batches of processed documents to the configured search backend.

java \
 -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> \
 -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' \
 com.kmwllc.lucille.core.Indexer \
 simple_pipeline

What this Does

  • -Dconfig.file=<PATH/TO/YOUR/CONFIG.conf> tells Lucille where to find your configuration.
  • -cp 'lucille-core/target/lucille.jar:lucille-core/target/lib/*' loads Lucille and its dependencies.
  • com.kmwllc.lucille.core.Runner -useKafka starts the run and interacts with Kafka as described above.
  • com.kmwllc.lucille.core.Worker <pipelineName> processes documents through the configured pipeline as described above.
  • com.kmwllc.lucille.core.Indexer <pipelineName> writes processed documents to the configured backend as described above.

Trouble Running Lucille?

See the troubleshooting guide for common pitfalls.

Verifying Your Lucille Run

  • Logs: You should see Lucille start up, load your configuration, report component initialization, record counts, and completion status.

    During the run, you will see throughput and latency metrics like:

    25/10/31 13:40:21 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO WorkerPool: 27017 docs processed. One minute rate: 1787.10 docs/sec. Mean pipeline latency: 10.63 ms/doc.
    25/10/31 13:40:22 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO PublisherImpl: 37029 docs published. One minute rate: 3225.69 docs/sec. Mean connector latency: 0.00 ms/doc. Waiting on 21014 docs.
    25/10/31 13:40:22 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Indexer: 17016 docs indexed. One minute rate: 455.07 docs/sec. Mean backend latency: 6.90 ms/doc.
    

    At completion, Lucille prints a stage-by-stage performance summary and a final run result:

    25/10/31 13:46:47 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Stage: Stage test_source metrics. Docs processed: 200000. Mean latency: 0.0003 ms/doc. Children: 0. Errors: 0.
    25/10/31 13:46:47 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Stage: Stage test_summary metrics. Docs processed: 200000. Mean latency: 0.3532 ms/doc. Children: 0. Errors: 0.
    25/10/31 13:46:47 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Runner: 
    RUN SUMMARY: Success. 1/1 connectors complete. All published docs succeeded.
    connector1: complete. 200000 docs succeeded. 0 docs failed. 0 docs dropped. Time: 416.47 secs.
    25/10/31 13:46:47 6790d2e9-1079-4f15-b75a-acab4ae8e4c2  INFO Runner: Run took 417.46 secs.
    
  • Output: View your target service (e.g., Elasticsearch) to verify your index.

1 - Installation

A guide to installing Lucille locally.

Prerequisites

To build and run Lucille from source, you need:

  • Java 17+ JDK (not just a JRE)
  • Maven (recent version)

Java Setup (JDK 17+ Required)

Important: Before running any Lucille commands, make sure JAVA_HOME points to a JDK 17+ (not just a JRE) and that $JAVA_HOME/bin is on your PATH (or %JAVA_HOME%\bin on Windows). Maven and the java launcher rely on this.

Verify Java

java -version

You should see version 17 (or newer). If it’s missing or older than 17 install a JDK 17+ using one of the options below.

Install Options

Package manager

  • macOS (Homebrew)
    brew install openjdk@17
    
  • Windows (Chocolatey)
    choco install microsoft-openjdk17
    

Vendor installer

  • Download a JDK 17+ installer from a vendor such as Oracle JDK.
  • Run the installer, then set JAVA_HOME as shown below.

Set JAVA_HOME and PATH

macOS

export JAVA_HOME="$(/usr/libexec/java_home -v 17)"
export PATH="$JAVA_HOME/bin:$PATH"

Windows

  • Open System Properties, Environment Variables.
  • Create/Edit JAVA_HOME and point it to your JDK folder.
  • Edit Path and add %JAVA_HOME%\bin above other Java entries.

Maven Setup

mvn -v

You should see a recent Maven version and your Java home. If mvn is not found, install Maven using one of the options below.

Install Options

Package manager

  • macOS (Homebrew)
    brew install maven
    
  • Windows (Chocolatey)
    choco install maven
    

Binary installer

  • Download the binary zip/tar for Apache Maven from the official website.
  • Add Maven’s bin/ to your PATH.

macOS

export PATH="<maven-dir>/bin:$PATH"

Windows

  • Open System Properties, Environment Variables.
  • Edit Path and add <maven-dir>/bin.

Clone the Repository

git clone https://github.com/kmwtechnology/lucille.git

Build Lucille

cd lucille
mvn clean install

This compiles all modules and produces build artifacts under each module’s target/ folder.

2 - Cookbooks

Developer cookbooks for using Lucille to accomplish your goals.

2.1 - RSS Cookbook

A guide to using the RSS Connector in Lucille.

RSS to CSV

Let’s say we wanted to read from an RSS feed into a CSV using Lucille. We can set this up in the following manner:

  1. Create a .conf file to configure Lucille
  2. Specify the RSS connector in the connectors section of the config:
connectors: [
  {
    name: "RSSConnector"
    pipeline: "rssPipeline"
    class: "com.kmwllc.lucille.connector.RSSConnector"
    rssURL: "https://www.cnbc.com/id/15837362/device/rss/rss.html"
  }
]

There are a few additional configuration options that we won’t use here, but are useful:

    useGuidForDocID: true       # defaults to true; set false to use UUID as ID instead
    pubDateCutoff: "24h"        # only publish items from the last 24 hours
    runDuration: "1h"           # run incrementally for 1 hour total
    refreshIncrement: "5m"      # re-fetch the feed every 5 minutes

Your pipeline name can be whatever you want. For our URL, we chose CNBC’s RSS feed.

  1. Now we define what stages we would like to use to process our documents from the feed. To give context as to what these stages are doing:

News items in an RSS feed often have some article metadata, and then a link to the actual meat of the article in HTML as a field.

  • The fetchURI stage allows us to grab the actual content of our associated news article.
  • The ApplyJSoup stage parses that content into fields that will exist in addition to our article metadata from the RSS feed. These fields include the body, bullet points, and the header.
pipelines: [
  {
    name: "rssPipeline"
    stages: [
      {
        name: "fetchURI",
        class: "com.kmwllc.lucille.stage.FetchUri"
        source: "link"
        dest: "content"
      }
      {
        name: "ApplyJSoup"
        class: "com.kmwllc.lucille.stage.ApplyJSoup"
        byteArrayField: "content"
        destinationFields: {
          paragraphTexts: {
            type: "text",
            selector: ".ArticleBody-articleBody p"
          }
          bulletPoints: {
            type: "text",
            selector: ".RenderKeyPoints-list li"
          }
          headline: {
            type: "text"
            selector: "h1"
          }
        }
      }
    ]
  }
]
  1. We can index these documents into whatever we’d like. Here, we might decide to just print them to a CSV:
indexer: {
  type: "csv"
}

csv: {
  path: "./rss_results.csv"
  columns: ["id", "link", "title", "description", "paragraphTexts",
    "bulletPoints", "headline"]
}

Here is the full config file:

connectors: [
  {
    name: "RSSConnector"
    pipeline: "rssPipeline"
    class: "com.kmwllc.lucille.connector.RSSConnector"
    rssURL: "https://www.cnbc.com/id/15837362/device/rss/rss.html"
  }
]

pipelines: [
  {
    name: "rssPipeline"
    stages: [
      {
        name: "fetchURI",
        class: "com.kmwllc.lucille.stage.FetchUri"
        source: "link"
        dest: "content"
      }
      {
        name: "ApplyJSoup"
        class: "com.kmwllc.lucille.stage.ApplyJSoup"
        byteArrayField: "content"
        destinationFields: {
          paragraphTexts: {
            type: "text",
            selector: ".ArticleBody-articleBody p"
          }
          bulletPoints: {
            type: "text",
            selector: ".RenderKeyPoints-list li"
          }
          headline: {
            type: "text"
            selector: "h1"
          }
        }
      }
    ]
  }
]

indexer: {
  type: "csv"
}

csv: {
  path: "./rss_results.csv"
  columns: ["id", "link", "title", "description", "paragraphTexts",
    "bulletPoints", "headline"]
}

The CSV file will thus be saved on disk.

RSS to OpenSearch

We might also choose to index into another destination, like an OpenSearch index. Here’s an example. Replace the config with this:

connectors: [
  {
    name: "RSSConnector"
    pipeline: "rssPipeline"
    class: "com.kmwllc.lucille.connector.RSSConnector"
    rssURL: "https://www.cnbc.com/id/15837362/device/rss/rss.html"
    refreshIncrement: "60s"
    runDuration: "1h"
  }
]

pipelines: [
  {
    name: "rssPipeline"
    stages: []
  }
]

indexer: {
  type: "opensearch"
}

opensearch: {
  url: <Your OpenSearch URL>
  index: "rss-index"
  acceptInvalidCert: true
}

You’ll notice we’re using incremental mode now, with a refreshIncrement of 60s and a runDuration of 1h. This means that every item in the feed will be indexed on the initial run, and then Lucille will continue to take in new items that pop up every 60 seconds for 1 hour total.

Run Lucille again. Here’s what 3 of our documents look like after being indexed into OpenSearch:

GET /rss-index/_search
{
  "size": 3
}
{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 30,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "rss-index",
        "_id": "108275704",
        "_score": 1,
        "_source": {
          "id": "108275704",
          "guid": "108275704",
          "isPermaLink": false,
          "link": "https://www.cnbc.com/2026/03/09/watch-live-trump-press-conference-iran-war-oil-hormuz-doral.html",
          "title": "Watch live: Trump holds press conference as Iran war fallout roils oil market",
          "pubDate": "2026-03-09T21:13:25Z",
          "run_id": "cb234bd2-fdf7-4b88-be13-20de36cd059e"
        }
      },
      {
        "_index": "rss-index",
        "_id": "108275619",
        "_score": 1,
        "_source": {
          "id": "108275619",
          "description": "The OpenAI deal fallout exposes the fundamental danger of being the most leveraged player.",
          "guid": "108275619",
          "isPermaLink": false,
          "link": "https://www.cnbc.com/2026/03/09/oracle-is-building-yesterdays-data-centers-with-tomorrows-debt.html",
          "title": "Oracle is building yesterday's data centers with tomorrow's debt",
          "pubDate": "2026-03-09T20:52:19Z",
          "run_id": "cb234bd2-fdf7-4b88-be13-20de36cd059e"
        }
      },
      {
        "_index": "rss-index",
        "_id": "108275649",
        "_score": 1,
        "_source": {
          "id": "108275649",
          "description": "U.S. stock market indexes rose on the heels of reported comments by President Donald Trump.",
          "guid": "108275649",
          "isPermaLink": false,
          "link": "https://www.cnbc.com/2026/03/09/trump-iran-war-end.html",
          "title": "Trump says Iran 'war is very complete,' talks to Putin, reports say",
          "pubDate": "2026-03-09T21:32:43Z",
          "run_id": "cb234bd2-fdf7-4b88-be13-20de36cd059e"
        }
      }
    ]
  }
}

Using other indexers with the RSS connector follows much the same pattern.