This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

This is a placeholder page that shows you how to use this template site.

This section is where the user documentation for your project lives - all the information your users need to understand and successfully use your project.

For large documentation sets we recommend adding content under the headings in this section, though if some or all of them don’t apply to your project feel free to remove them or add your own. You can see an example of a smaller Docsy documentation site in the Docsy User Guide, which lives in the Docsy theme repo if you’d like to copy its docs section.

Other content such as marketing material, case studies, and community updates should live in the About and Community pages.

Find out how to use the Docsy theme in the Docsy User Guide. You can learn more about how to organize your documentation (and how we organized this site) in Organizing Your Content.

1 - About Lucille

Understanding what Lucille is, why you should use it, and how it works.

What is Lucille?

Lucille is a production-grade Search ETL solution designed to efficiently get data into Lucene-based search engines such as Elasticsearch, OpenSearch, and Solr as well as vector databases such as Pinecone & Weaviate. Lucille enables complex processing of documents before they are indexed by search engine, freeing up resources that can be used by the search engine to compute queries with greater speed.

Lucille is Java-based and open-source. Lucille supports batch, incremental, and streaming data ingestion architectures.

Why use Lucille?

Search ETL is a category of ETL problem where data must be extracted from a source system, transformed, and loaded into a search engine.

A Search ETL solution must speak the language of search: it must represent data in the form of search-engine-ready Documents, it must know how to enrich Documents to support common search use cases, and it must follow best practices for interacting with search engines including support for batching, routing, and versioning.

To be production-grade, a search ETL solution must be scalable, reliable, and easy to use. It should support parallel Document processing, it should be observable, it should be easy to configure, it should have extensive test coverage, and it should have been hardened through multiple challenging real-world deployments.

Lucille handles all of these things so you don’t have to. Lucille helps you get your data into Lucene-based search engines like Apache Solr, Elasticsearch, or OpenSearch as well as vector-based search engines like Pinecone and Weaviate, and it helps you keep that search engine content up-to-date as your backend data changes. Lucille does this in a way that scales as your data volume grows, and in a way that’s easy to evolve as your data transformation requirements become more complex. Lucille implements search best practices so you can stay focused on your data itself and what you want to do with it.

How Does Lucille Work?

The basic architectural ideas of Lucille are as follows:

  1. A Connector retrieves data from a source system.
  2. Worker(s) enrich the data.
  3. Indexer(s) index the data into a search engine.
  4. These three core components (Connectors, Workers, and Indexers) run concurrently and communicate with each other using a messaging framework.
    • The core components can function as threads inside a JVM, allowing for all of Lucille run as a single Java process, allowing for a simple and easy deployment model.
    • The core components can function as standalone Java processes communicating through an external Apache Kafka message broker, allowing for massive scale.
  5. Documents are enriched en-route to the search engine using a Pipeline built from composable processing Stages. The pipeline is configuration-driven.
  6. Document lifecycle events (such as creation, indexing, and erroring out) are tracked so that the framework can determine when all the work in a batch ingest is complete.

Installation

To use Lucille you will need a Java development environment with Java 17 or later and a recent version of Maven. Start by cloning the repository:

git clone https://github.com/kmwtechnology/lucille.git

At the top level of the project, run:

mvn clean install

Getting Started

Lucille includes a few examples in the lucille-examples module to help you get started.

To see how to ingest the contents of a local CSV file into an instance of Apache Solr, refer to lucille-examples/simple-csv-solr-example.

To run this example, start an instance of Apache Solr on port 8983 and create a collection called quickstart. For more information about how to use Solr, see the Apache Solr Reference Guide).

Go to lucille-examples/lucille-simple-csv-solr-example in your working copy of Lucille and run:

mvn clean install

./scripts/run_ingest.sh

This script executes Lucille with a configuration file named simple-csv-solr-example.conf that tells Lucille to read a CSV of top songs and send each row as a document to Solr.

Run a commit with openSearcher=true on your quickstart collection to make the documents visible. Go to your Solr admin dashboard, execute a *:* query and you should see the songs from the source file now visible as Solr documents.


More Information

The Lucille project is developed and maintained by KMW Technology (kmwllc.com). For more information regarding Lucille, please contact us.

Up Next

2 - Quick Start Guides

This is a placeholder page that shows you how to use this template site.

Information in this section helps your user try your project themselves.

  • What do your users need to do to start using your project? This could include downloading/installation instructions, including any prerequisites or system requirements.

  • Introductory “Hello World” example, if appropriate. More complex tutorials should live in the Tutorials section.

Consider using the headings below for your getting started page. You can delete any that are not applicable to your project.

Prerequisites

Are there any system requirements for using your project? What languages are supported (if any)? Do users need to already have any software or tools installed?

Installation

Where can your user find your project code? How can they install it (binaries, installable package, build from source)? Are there multiple options/versions they can install and how should they choose the right one for them?

Setup

Is there any initial setup users need to do after installation to try your project?

Try it out!

Can your users test their installation, for example by running a command or deploying a Hello World example?

2.1 - Quick Start Guide - Distributed Mode

A short lead description about this content page. It can be bold or italic and can be split over multiple paragraphs.

This is a placeholder page. Replace it with your own content.

2.2 - Quick Start Guide - Hybrid Mode

A short lead description about this content page. It can be bold or italic and can be split over multiple paragraphs.

This is a placeholder page. Replace it with your own content.

2.3 - Quick Start Guide - Local Mode

A short lead description about this content page. It can be bold or italic and can be split over multiple paragraphs.

This is a placeholder page. Replace it with your own content.

3 - Architecture

Understanding Lucille’s core components & topology.

This is a placeholder page that shows you how to use this template site.

For many projects, users may not need much information beyond the information in the Overview, so this section is optional. However if there are areas where your users will need a more detailed understanding of a given term or feature in order to do anything useful with your project (or to not make mistakes when using it) put that information in this section. For example, you may want to add some conceptual pages if you have a large project with many components and a complex architecture.

Remember to focus on what the user needs to know, not just what you think is interesting about your project! If they don’t need to understand your original design decisions to use or contribute to the project, don’t put them in, or include your design docs in your repo and link to them. Similarly, most users will probably need to know more about how features work when in use rather than how they are implemented. Consider a separate architecture page for more detailed implementation and system design information that potential project contributors can consult.

3.1 - Topology

What does your user need to understand about your project in order to use it - or potentially contribute to it?

This is a placeholder page that shows you how to use this template site.

For many projects, users may not need much information beyond the information in the Overview, so this section is optional. However if there are areas where your users will need a more detailed understanding of a given term or feature in order to do anything useful with your project (or to not make mistakes when using it) put that information in this section. For example, you may want to add some conceptual pages if you have a large project with many components and a complex architecture.

Remember to focus on what the user needs to know, not just what you think is interesting about your project! If they don’t need to understand your original design decisions to use or contribute to the project, don’t put them in, or include your design docs in your repo and link to them. Similarly, most users will probably need to know more about how features work when in use rather than how they are implemented. Consider a separate architecture page for more detailed implementation and system design information that potential project contributors can consult.

3.1.1 - Distributed & Hybrid Modes

A short lead description about this content page.

Fully Distributed

Connectorless Distributed

Connectorless Hybrid

3.1.2 - Local Modes

A short lead description about this content page.

Local Mode

Local Kafka Mode

3.2 - Components

What does your user need to understand about your project in order to use it - or potentially contribute to it?

This is a placeholder page that shows you how to use this template site.

For many projects, users may not need much information beyond the information in the Overview, so this section is optional. However if there are areas where your users will need a more detailed understanding of a given term or feature in order to do anything useful with your project (or to not make mistakes when using it) put that information in this section. For example, you may want to add some conceptual pages if you have a large project with many components and a complex architecture.

Remember to focus on what the user needs to know, not just what you think is interesting about your project! If they don’t need to understand your original design decisions to use or contribute to the project, don’t put them in, or include your design docs in your repo and link to them. Similarly, most users will probably need to know more about how features work when in use rather than how they are implemented. Consider a separate architecture page for more detailed implementation and system design information that potential project contributors can consult.

3.2.1 - Config

How to configure and validate your configuration for a Lucille run.

Lucille Configuration

When you run Lucille, you provide a path to a file which provides configuration for your run. Configuration (Config) files use HOCON, a superset of JSON. This file defines all the components in your Lucille run.

A complete config file must contain three elements:

Connectors

Connectors read data from a source and emit it as a sequence of individual Documents, which will then be sent to a Pipeline for enrichment.

connectors should be populated with a list of Connector configurations.

See Connectors for more information about configuring Connectors.

Pipelines

A pipeline is a list of Stages that will be applied to incoming Documents, preparing them for indexing. As each Connector executes, the Documents it publishes can be processed by a Pipeline, made up of Stages.

pipelines should be populated with a list of Pipeline configurations. Each Pipeline needs two values: name, the name of the Pipeline, and stages, a list of the Stages to use. Multiple connectors may feed to the same Pipeline.

See Stages for more information about configuring Stages.

Indexer

An indexer sends processed Documents to a specific destination. Only one Indexer can be defined; all pipelines will feed to the same Indexer.

A full indexer configuration has two parts: first, the generic indexer configuration, and second, configuration for the specific indexer used in your run. For example, to use the SolrIndexer, you provide indexer and solr config.

See Indexers for more information about configuring your Indexer.

Other Run Configuration

In addition to those three elements, you can also configure other parts of a Lucille run.

  • publisher - Define the queueCapacity.
  • log
  • runner
  • zookeeper
  • kafka - Provide a consumerPropertyFile, producerPropertyFile, adminPropertyFile, and other configuration.
  • worker - Control how many threads you want, the maxRetries in Zookeeper, and more.

Validation

Lucille validates the Config you provide for Connectors, Stages, and Indexers. For example, in a Stage, if you provide a property the Stage does not use, an Exception will be thrown. An Exception will also be thrown if you do not provide a property required by the Stage.

If you want to validate your config file without starting an actual run, you can use our command-line validation tool. Just add -validate to the end of your command executing Lucille. The errors with your config will be printed out to the console, and no actual run will take place.

3.2.1.1 - Config Validation

How Lucille validates Configs - and what developers need to know.

Config Validation

Lucille components (like Stages, Indexers, and Connectors) each take in a set of specific arguments to configure the component correctly. Sometimes, certain properties are required - like the pathsToStorage for your FileConnector traversal, or the path for your CSVIndexer. Other properties are optional / do not always have to be specified.

For these components, developers must declare a Spec which defines the properties that are required or optional. They must also declare what type each property is (number, boolean, string, etc.). For example, the SequenceConnector requires you to specify the numDocs you want to create, and optionally, the number you want IDs to startWith. So, the Spec looks like this:

public static final Spec SPEC = Spec.connector()
      .requiredNumber("numDocs")
      .optionalNumber("startWith");

Declaring a Spec

Lucille is designed to access Specs reflectively. If you build a Stage/Indexer/Connector/File Handler, you need to declare a public static Spec named SPEC (exactly). Failure to do so will not result in a compile-time error. However, you will not be able to instantiate your component - even in unit tests - as the reflective access (which takes place in the super / abstract class) will always fail.

When you declare the public static Spec SPEC, you’ll want to call the appropriate Spec method which provides appropriate default arguments for your component. For example, if you are building a Stage, you should call Spec.stage(), which allows the config to include name, class, conditions, and conditionPolicy.

Lists and Objects

Validating a list / object is a bit tricky. When you declare a required / optional list or object in a Config, you can either provide:

  1. A TypeReference describing what the unwrapped List/Object should deserialize/cast to.
  2. A Spec, for a list, or a ParentSpec, for an object, describing the valid properties. (Use a Spec for a list when you need a List<Config> with specific structure. For example, Stage conditions.)

Parent / Child Validation

Some configs include properties which are objects, containing even more properties. For example, in the FileConnector, you can specify fileOptions - which includes a variety of additional arguments, like getFileContent, handleArchivedFiles, and more. This is defined in a ParentSpec. A ParentSpec has a name (the key the config is held under) and has its own required/optional properties. The fileOptions ParentSpec is:

Spec.parent("fileOptions")
  .optionalBoolean("getFileContent", "handleArchivedFiles", "handleCompressedFiles")
  .optionalString("moveToAfterProcessing", "moveToErrorFolder");

A ParentSpec can be either required or optional. When the parent is present, its properties will be validated against this ParentSpec.

There will be times that you can’t know what the field names would be in advance. For example, a field mapping of some kind. In this case, you should pass in a TypeReference describing what type the unwrapped ConfigObject should deserialize/cast to. For example, if you want a field mapping of Strings to Strings, you’d pass in new TypeReference<Map<String, String>>(){}. In general, you should use a Spec when you know field names, and a TypeReference when you don’t.

Why Validate?

Obviously, you will get an error if you call config.getString("field") when "field" is not specified. However, additional validation on Configs is still useful/necessary for two primary reasons:

  1. Command-line utility

We want to allow the command-line validation utility to provide a comprehensive list of a Config’s errors. As such, Lucille has to validate the config before a Stage/Connector/Indexer begins accessing properties and potentially throwing a ConfigException.

  1. Prevent typos from ruining your pipeline

A mistyped field name could have massive ripple effects throughout your pipeline. As such, each Stage/Connector/Indexer needs to have a specific set of legal Config properties, so Exceptions can be raised for unknown or unrecognized properties.

3.2.2 - Indexer

An Indexer sends processed Documents to a specific destination.

Indexers

An indexer sends processed Documents to a specific destination. Only one Indexer can be defined in a Lucille run. All pipelines will feed to the same Indexer.

Indexer configuration has two parts: the generic indexer configuration, and configuration for the implementation you are using. For example, if you are using Solr, you’d provide solr config, or elastic for Elasticsearch, csv for CSV, etc.

Here’s what using the SolrIndexer might look like:

# Generic indexer config
indexer {
  type: "solr"
  ignoreFields: ["city_temp"]
  batchSize: 100
}
# Specific implementation (Solr) config
solr {
  useCloudClient: true
  url: "localhost:9200"
  defaultCollection: "test_index"
}

At a minimum, indexer must contain either type or class. type is shorthand for an indexer provided by lucille-core - it can be "Solr", "OpenSearch", "ElasticSearch", or "CSV". indexer can contain a variety of additional properties as well. Some Indexers do not support certain properties, however. For example, OpenSearchIndexer and ElasticsearchIndexer do not support indexer.indexOverrideField.

The lucille-core module contains a number of commonly used indexers. Additional indexers with a large number of dependencies are provided as optional plugin modules.

Lucille Indexers (Core)

Lucille Indexers (Plugins)

3.2.3 - Stages

A Stage performs a specific transformation on a Document.

Lucille Stages

Stages are the building blocks of a Lucille pipeline. Each Stage performs a specific transformation on a Document.

Lucille Stages should have JavaDocs that describe their purpose and the parameters acceptable in their Config. On this site, you’ll find more in-depth documentation for some more advanced / complex Lucille Stages.

To configure a stage, you have to provide its class (under class) in its config. You can also specify a name for the Stage as well, in addition to conditions and conditionPolicy (described below).

You’ll also provide the parameters needed by the Stage as well. For example, the AddRandomBoolean Stage accepts two optional parameters - field_name and percent_true. So, an AddRandomBoolean Config would look something like this:

{
  name: "AddRandomBoolean-First"
  class: "com.kmwllc.lucille.stage.AddRandomBoolean"
  field_name: "rand_bool_1"
  percent_true: 65
}

Conditions

For any Stage, you can specify “conditions” in its Config, controlling when the Stage will process a Document. Each condition has a required parameter, fields, and two optional parameters, operator and values.

  • fields is a list of field names that will determine whether the Stage applies to a Document.

  • values is a list of values that the conditional fields will be searched for. (If not specified, only the existence of fields is checked.)

  • operator is either "must" or "must_not" (defaults to "must").

In the root of the Stage’s Config, you can also specify a conditionPolicy - either "any" or "all", specifying whether any or all of your conditions must be met for the Stage to process a Document. (Defaults to "any".)

Let’s say we are running the Print Stage, but we only want it to execute on a Document where city = Boston or city = New York. Our Config for this Stage would look something like this:

{
name: "print-1"
class: "com.kmwllc.lucille.stage.Print"
conditions: [
  {
    fields: ["city"]
    values: ["Boston", "New York"]
  }
]
}

3.2.3.1 - PromptOllama

Connect to Ollama Server and send a Document to an LLM for enrichment.

What if you could just, actually, put an LLM on everything?

Ollama

Ollama allows you to run a variety of Large Language Models (LLMs) with minimal setup. You can also create custom models using Modelfiles and system prompts.

The PromptOllama Stage allows you to connect to a running instance of Ollama Server, which communicates with an LLM through a simple API. The Stage sends part (or all) of a Document to the LLM for generic enrichment. You’ll want to create a custom model (with a Modelfile) or provide a System Prompt in the Stage Config that is tailored to your pipeline.

We strongly recommend you have the LLM output only a JSON object for two main reasons: Firstly, LLMs tend to follow instructions better when instructed to do so. Secondly, Lucille can then parse the JSON response and fully integrate it into your Document.

Example

Let’s say you are working with Documents which represent emails, and you want to monitor them for potential signs of fraud. Lucille doesn’t have a DetectFraud Stage (at time of writing), but you can use PromptOllama to add this information with an LLM.

  • Modelfile: Let’s say you created a custom model, fraud_detector, in your instance of Ollama Server. As part of the modelfile, you instruct the model to check the contents for fraud and output a JSON object containing just a boolean value (under fraud). Your Stage would be configured like so:
{
  name: "Ollama-Fraud"
  class: "com.kmwllc.lucille.stage.PromptOllama"
  hostURL: "http://localhost:9200"
  modelName: "fraud_detector"
  fields: ["email_text"]
}
  • System Prompt: You can also just reference a specific LLM directly, and provide a system prompt in the Stage configuration.
{
  name: "Ollama-Fraud"
  class: "com.kmwllc.lucille.stage.PromptOllama"
  hostURL: "http://localhost:9200"
  modelName: "gemma3"
  systemPrompt: "You are to read the text inside \"email_text\" and output a JSON object containing only one field, fraud, a boolean, representing whether the text contains evidence of fraud or not."
  fields: "email_text"
}

Regardless of the approach you choose, the LLM will receive a request that looks like this:

{
  "email_text": "Let's be sure to juice the numbers in our next quarterly earnings report."
}

(Since fields: ["email_text"], any other fields on this Document are not part of the request.)

And the response from the LLM should look like this:

{
  "fraud": true
}

Lucille will then add all key-value pairs in this response JSON into your Document. So, the Document will become:

{
  "id": "emails.csv-85",
  "run-id": "f9538992-5900-459a-90ce-2e8e1a85695c",
  "email_text": "Let's be sure to juice the numbers in our next quarterly earnings report.",
  "fraud": true
}

As you can see, PromptOllama is very versatile, and can be used to enrich your Documents in a lot of ways.

3.2.3.2 - QueryOpensearch

Execute an OpenSearch Template using information from a Document, and add the response to it.

OpenSearch Templates

You can use templates in OpenSearch to repeatedly run a certain query using different parameters. For example, if we have an index full of parks, and we want to search for a certain park, we might use a template like this:

{
  "source": {
    "query": {
      "match_phrase": {
        "park_name": "{{park_to_search}}"
      }
    }
  }
}

In Opensearch, you could then call this template (providing it park_to_search) instead of writing out the full query each time you want to search.

Templates can also have default values. For example, if you want park_to_search to default to “Central Park” when a value is not provided, it would be written as: "park_name": "{{park_to_search}}{{^park_to_search}}Central Park{{/park_to_search}}"

QueryOpensearch Stage

The QueryOpensearch Stage executes a search template using certain fields from a Document as your parameters and adding OpenSearch’s response to the Document. You’ll specify either templateName, the name of a search template you’ve saved, or searchTemplate, the template you want to execute, in your Config.

You’ll also need to specify the names of parameters in your search template. These will need to match the names of fields on your Documents. If your names don’t match, you can use the RenameFields Stage first.

In particular, you have to specify which parameters are required and which are optional. If a required name in requiredParamNames is missing from a Document, an Exception will be thrown, and the template will not be executed. If an optional name in optionalParamNames is missing they (naturally) won’t be part of the template execution, so the default value will be used by OpenSearch.

If a parameter without a default value is missing, OpenSearch doesn’t throw an Exception - it just returns an empty response with zero hits. So, it is very important that requiredParamNames and optionalParamNames are defined very carefully!

3.2.4 - File Handlers

File Handlers extract Lucille documents from individual files, like CSV or JSON files, which themselves contain data which can be transformed into Lucille Documents.

File Handlers accept an InputStream for processing, and return the Documents they extract in an Iterator. The provided InputStream and any other underlying resources are closed when the Iterator returns false for hasNext(). As such, when working directly with these File Handlers, it is important to exhaust the Iterators they return.

File Handlers (Core)

Custom File Handlers

Developers can implement and use custom File Handlers as needed. Extend BaseFileHandler to get started. To use a custom FileHandler, you have to reference its class in its Config. This is not needed when using the File Handlers provided by Lucille. You can override the File Handlers provided by Lucille, as well - just include the class you want to use in the Config.

3.2.5 - Connectors

A component that retrieves data from a source system, packages the data into “documents,” and publishes them.

Lucille Connectors

Lucille Connectors are components that retrieve data from a source system, packages the data into “Documents”, and publishes them to a pipeline.

To configure a Connector, you have to provide its class (under class) in its config. You also need to specify a name for the Connector. Optionally, you can specify the pipeline, a docIdPrefix, and whether the Connector requires a Publisher to collapse.

You’ll also provide the parameters needed by the Connector as well. For example, the SequenceConnector requires one parameter, numDocs, and accepts an optional parameter, startWith. So, a SequenceConnector Config would look something like this:

{
  name: "Sequence-Connector-1"
  class: "com.kmwllc.lucille.connector.SequenceConnector"
  docIdPrefix: "sequence-connector-1-"
  pipeline: "pipeline1"
  numDocs: 500
  startWith: 50
}

The lucille-core module contains a number of commonly used connectors. Additional connectors with a large number of dependencies are provided as optional plugin modules.

Lucille Connectors (Core)

The following connectors are deprecated. Use FileConnector instead, along with a corresponding FileHandler.

Lucille Connectors (Plugins)

3.2.5.1 - RSS Connector

A Connector that publishes Documents representing items found in an RSS feed.

The RSSConnector

The RSSConnector allows you to publish Documents representing the items found in an RSS feed of your choice. Each Document will (optionally) contain fields from the RSS items, like the author, description, title, etc. By default, the Document IDs will be the item’s guid, which should be a unique identifier for the RSS item.

You can configure the RSSConnector to only publish recent RSS items, based on the pubDate found on the items. Also, it can run incrementally, refreshing the RSS feed after a certain amount of time until you manually stop it. The RSSConnector will avoid publishing Documents for the same RSS item more than once.

The Documents published may have any of the following fields, depending on how the RSS feed is structured:

  • author (String)
  • categories (List<String>)
  • comments (List<String>)
  • content (String)
  • description (String)
  • enclosures (List<JsonNode>). Each JsonNode contains:
    • type (String)
    • url (String)
    • May contain length (Long)
  • guid (String)
  • isPermaLink (Boolean)
  • link (String)
  • title (String)
  • pubDate (Instant)

3.2.5.2 - Database Connector

Database Connector

3.2.5.3 - File Connector

A Connector that, given a path to S3, Azure, Google Cloud, or the local file system, traverses the content at the given path and publishes Lucille documents representing its findings.

Source Code

The file connector traverses a file system and publishes Lucille documents representing its findings. In your Configuration, specify pathsToStorage, representing the path(s) you want to traverse. Each path can be a path to the local file system or a URI for a supported cloud provider.

Working with Cloud Storage

When you are providing FileConnector with URIs to cloud storage, you also need to apply the appropriate configuration for any cloud providers used. For each provider, you’ll need to provide a form of authentication; you can optionally specify the maximum number of files (maxNumOfPages) that Lucille will load into memory for a given request.

  • Azure: Specify the needed options in azure in your Config. You must provide connectionString, or you must provide accountName and accountKey.
  • Google: Specify the needed options in gcp in your Config. You must provide pathToServiceKey.
  • S3: Specify the needed options in s3 in your Config. You must provide accessKeyId, secretAccessKey, and region.
  • For each of these providers, in their configuration, you can optionally include maxNumOfPages as well.

Applying FileHandlers

Some of the files that your FileConnector encounters will, themselves, contain data that you want to extract more documents from! For example, the FileConnector may encounter a .csv file, where each row itself represents a Document to be published. This is where FileHandlers come in - they will individually process these files and create more Lucille documents from their data. See File Handlers for more.

In order to use File Handlers, you need to specify the appropriate configuration within your Config - specifically, each File Handler you want to use will be a map within the fileOptions you specify. You can use csv, json, or xml. See the documentation for each File Handler to see what arguments are needed / accepted.

File Options

File options determine how you handle and process files you encounter during a traversal. Some commonly used options include:

  • getFileContent: Whether, during traversal, the FileConnector should add an array of bytes representing the file’s contents to the Lucille document it publishes. This will slow down traversal significantly and is resource intensive. On the cloud, this will download the file contents.
  • handleArchivedFiles/handleCompressedFiles: Whether you want to handle archive or compressed files, respectively, during your traversal. For cloud files, this will download the file’s contents.
  • moveToAfterProcessing: A path to move files to after processing.
  • moveToErrorFolder: A path to move files to if an error occurs.

Filter Options

Filter options determine which files will/won’t be processed & published in your traversal. All filter options are optional. If you specify multiple filter options, files must comply with all of them to be processed & published.

  • includes: A list of patterns for the only file names that you want to include in your traversal.
  • excludes: A list of patterns for file names that you want to exclude from your traversal.
  • lastModifiedCutoff: Filter out files that haven’t been modified recently. For example, specify "1h", and only files modified within the last hour will be processed & published.
  • lastPublishedCutoff: Filter out files that were recently published by Lucille. For example, specify "1h", and only files published by Lucille more than an hour ago (or never published) will be processed & published. Requires you to provide state configuration, otherwise, it will not be enforced!

State

The File Connector can keep track of when files were last known to be published by Lucille. This allows you to use FilterOptions.lastPublishedCutoff and avoid repeatedly publishing the same files in a short period of time.

In order to use state with the File Connector, you’ll need to configure a connection to a JDBC-compatible database. The database can be embedded, or it can be remote.

It’s important to note that File Connector state is designed to be efficient and lightweight. As such, keep a few points in mind:

  1. Files that were recently moved / renamed files will not have the lastPublishedCutoff applied.
  2. In your File Connector configuration, it is important that you consistently capitalize directory names in your pathToStorage, if you are using state.
  3. Each database table should be used for only one connector configuration.

3.2.6 - Document

All about the concept of a Document in lucille.

This is a placeholder page. Replace it with your own content.

Text can be bold, italic, or strikethrough. Links should be blue with no underlines (unless hovered over).

This is the final element on the page and there should be no margin below this. TEST

3.2.7 - Events

A short lead description about this content page.

This is a placeholder page. Replace it with your own content.

Text can be bold, italic, or strikethrough. Links should be blue with no underlines (unless hovered over).

This is the final element on the page and there should be no margin below this.

3.2.8 - Indexers

A thread that retrieves completed documents and sends them in batches to a search engine.

Lucille Indexers

The core Lucille project contains a number of commonly used Indexers. Additional Indexers are provided as optional plugin modules.

Lucille Indexers (Core)

Lucille Indexers (Plugins)

3.2.9 - Pipeline

A short lead description about this content page.

This is a placeholder page. Replace it with your own content.

Text can be bold, italic, or strikethrough. Links should be blue with no underlines (unless hovered over).

This is the final element on the page and there should be no margin below this.

3.2.10 - Publisher

A short lead description about this content page.

This is a placeholder page. Replace it with your own content.

Text can be bold, italic, or strikethrough. Links should be blue with no underlines (unless hovered over).

This is the final element on the page and there should be no margin below this.

3.2.11 - Runner

A short lead description about this content page.

4 - Contribution Guidelines

How to develop new features and coding standards.

4.1 - Setup & Standards

Coding standards for Lucille and how to set up for local development.

Coding Standards

Local Developer Setup

Prerequisite(s):

  • IntelliJ application installed on machine
  • Java project

Setting up Google Code Formatting Scheme

  • Make sure that Intellij is open
  • Go to the following link: styleguide/intellij-java-google-style.xml at gh-pages · google/styleguide
  • Download the .xml file
  • Open the file in an editor of your choice
  • Navigate to the <option …> tag with name ‘Right Margin’ and edit the value to be 132 (it should default as 100)
  • Save the file
  • In Intellij IDEA, navigate to Settings | Preferences → Code Style → Editor → Java
  • Click on the gear icon on the right panel and drill down to the option Import Scheme and then to Intellij IDEA Code Style XML
  • In the file explorer that opens, navigate to where you stored the aforementioned .xml file we downloaded
  • After selecting the file, you should see a pop-up allowing you to name the scheme; select a name and click ‘Okay’
  • Click ‘Apply’ in the Settings panel
  • Restart the IDE; You can use the ‘Reformat Code’ option to apply the plug-in on your code

Excluding Non-Java Files

Assuming that we don’t want to auto-format non-java files via a directory level ‘Reformat Code’ option, we need to exclude all other files from being reformatted

  • Navigate to Settings | Preferences in Intellij IDEA

  • Navigate to Editor → Code Style

  • Click on the tab on the right window labeled ‘Formatter’

  • In the ‘Do Not Format’ text box, paste the following and click ‘Apply'

    *.{yml,xml,md,json,yaml,jsonl,sql}

  • A restart of Intellij may be required to see changes

This method may prove to be too complicated, especially when new file types are added to the codebase, therefore, consider the following, simpler method instead:

  • When clicking on ‘Reformat Code’ at the directory level, a window will pop up
  • Under the filter sections in the window, select the ‘File Mask(s)’ option and set the value to ‘*.java’
  • This will INCLUDE all .java files in your reformatting

Eclipse Users

Eclipse import conf .xml files

The linked post details some useful information for how Eclipse users can use the same .xml for their code formatting on Eclipse IDE.

4.2 - Developing New Connectors

The basics of how to develop connectors for Lucille.

Base Classes

Unit Tests

4.3 - Developing New Stages

The basics of how to develop stages for Lucille.

Introduction

This guide covers the basics of how to develop a new stage for Lucille and what some of its required and optional components are. After reading this, you should be able to start development with a good foundation on how testing works, how configuration is handled, and what features the Stage class affords us.

Prerequisite(s)

  • An up-to-date version of Lucille that has been appropriately installed

  • Please refer to Quick Start Guides - Lucille Local Mode for more info on how to set up Lucille

  • Understanding of Java programming language

  • Understanding of Lucille

Developing a Stage:

Note: Angle brackets (<…>) are used to show placeholder information that is meant to be replaced by you when using the code snippets.

  1. Create a new Java class underneath the following directory in Lucille; name should be in PascalCase
    • lucille/lucille-core/src/main/java/com/kmw/lucille/stage/
  2. The new class should extend the Stage abstract class that can be found at the following directory in Lucille
    • lucille/lucille-core/src/main/java/com/kmw/lucille/core/
    • The class declaration should look like the following public class <StageName> extends Stage {
  3. Create a constructor, your constructor should take in a config variable of type Config
    • The constructor declaration should look similar to this public <StageName>(Config config) {
  4. Next, call super() to reference the protected super constructor from the Stage class
    • We want to provide this constructor with the aforementioned config, but also with the names of any required or optional parameters / parents that we want to make configurable via the config
    • Note that the properties should be in camelCase
    • Example provided below:
super(config, new StageSpec()
  .withRequiredProperties("<requiredProperty>")
  .withOptionalProperties("<optionalProperty>")
  .withRequiredParents("<requiredParent>"));
  1. Define instance variables that correlate to config properties you wish to request from the user (both required and optional)
    • The following code shows examples of common patterns used to extract config parameters; reference the Config code for more methods

config.getConfig("<nameOfProperty>").root().unwrapped(); // for required properties

config.hasPath("<nameOfProperty>") ? config.getInt("<nameOfProperty>") : <defaultValue>;

ConfigUtils.getOrDefault(config, "<nameOfProperty>", <defaultValue>;

  1. Add the abstract method processDocument to your class

    • This method is where we want to make changes to fields on the document and potentially create child documents
    • This method should return null assuming we are not intending to generate child documents
      • Reference the Javadoc in the Stage class for more information on how to support this functionality
  2. Add appropriate comments to explain any important code in the class and add Javadoc before the class declaration

    • Javadoc should explain the behaviour of this Stage and also should list config parameters, their types, whether they are optional, and a short description

Unit Testing:

Lucille uses JUnit as its testing framework, please refer to JUnit best practices when making tests.

  1. Create a new Java class underneath the following directory in Lucille; name should be the same as the Stage’s name, with Test appended to the end

    • lucille/lucille-core/src/test/java/com/kmw/lucille/stage/
  2. Create a new directory underneath the following directory in Lucille: name should same as the testing class’ name

    • lucille/lucille-core/src/test/resources/

    • Underneath this directory create a new file called config.conf; this will be an example config that will be used in our test class; you can create more for further testing

  3. The following code snippet can be used to create a new Stage with the provided config name

StageFactory.of(<StageName>.class).get("<StageTestName>/config.conf");

  1. The following code snippet will process a given Document; reference the Document class for more information s.processDocument(d); // where s is the Stage and d is the Document

The following are standards for testing in Lucille:

  • There should be at least one unit test for a Stage

  • Recommended to aim for 80% code coverage, but not a hard cutoff

  • When testing services, use mocks and spys

  • Tests should test notable exceptions thrown by Stage class code

  • Tests should cover all logical aspects of a Stage’s function

  • Tests should be deterministic

  • Code coverage should not only encapsulate aspects of lucille-core but also modules

Extra Stage Resources:

  • The Stage class has both the start and stop method; both are helpful for when we want to set up or tear down objects

  • Lucille also has a StageUtils class that has some methods that may prove useful in development

4.4 - Developing New Indexers

The basics of how to develop indexers for Lucille.

Base Classes

Unit Tests

5 - Running Lucille in Production

What does your user need to understand about your project in order to use it - or potentially contribute to it?

This is a placeholder page that shows you how to use this template site.

For many projects, users may not need much information beyond the information in the Overview, so this section is optional. However if there are areas where your users will need a more detailed understanding of a given term or feature in order to do anything useful with your project (or to not make mistakes when using it) put that information in this section. For example, you may want to add some conceptual pages if you have a large project with many components and a complex architecture.

Remember to focus on what the user needs to know, not just what you think is interesting about your project! If they don’t need to understand your original design decisions to use or contribute to the project, don’t put them in, or include your design docs in your repo and link to them. Similarly, most users will probably need to know more about how features work when in use rather than how they are implemented. Consider a separate architecture page for more detailed implementation and system design information that potential project contributors can consult.

5.1 - Logging

Interpreting and using Lucille logs.

Lucille Logs

There are many ways you can use the logs output by Lucille. Lucille has, essentially, two main loggers for tracking your Lucille run: the Root logger, and the DocLogger.

The Root Logger

The Root logger outputs log statements from a variety of sources, allowing you to track your Lucille run. For example, the Root logger is where you’ll get intermittent updates about your pipeline’s performance, warnings from Stages or Indexers in certain situations, etc.

The Doc Logger

The DocLogger is very verbose - it tracks the lifecycle of each Document in your Lucille pipeline. For example, a log statement is made when a Document is created, before it is published, before & after a Stage operates on it… etc. As you can imagine, this results in many log statements - it is recommended these logs are stored in a file, rather than just having them printed to the console. Logs from the DocLogger will primarily be INFO-level logs - very rarely, an ERROR-level log will be made for a Document.

Log Files

Lucille can store logs in a file as plain text or as JSON objects. When storing logs as JSON, each line will be a JSON object representing a log statement in accordance with the EcsLayout. By modifying the log4j2.xml, you can control which Loggers are enabled/disabled, where their logs get stored, and what level of logs you want to process.

Logstash & OpenSearch

If you store your logs as JSON, you can easily run Logstash on the file(s), allowing you to index them into a Search Engine of your choice for enhanced discovery and analysis. This can be particularly informative when working with the DocLogger. For example, you might:

  • Trace a specific Document’s lifecycle by querying by the Document’s ID.
  • Using the Timestamp of the logs, track the performance of your Lucille pipeline and identify potential bottlenecks.
  • Create Dashboards, allowing you to monitor your pipeline for potential warnings / errors for a repeated Lucille run.

Here is an example pipeline.conf for ingesting your Lucille logs into a local OpenSearch instance:

input {
  file {
    path => "/lucille/lucille-examples/lucille-simple-csv-solr-example/log/com.kmwllc.lucille-json*"
    mode => "read"
    codec => "json"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    exit_after_read => "true"
  }
}
output {
  stdout {
    codec => rubydebug
  }
  opensearch {
    hosts => "http://localhost:9200"
    index => "logs"
    ssl_certificate_verification => false
    ssl => false
  }
}

Note that this pipeline will delete the log files after they are ingested. And, SSL is disabled.

Here are some queries you might run (using curl):

curl -XGET "http://localhost:9200/logs/_search" -H 'Content-Type: application/json' -d '{
  "query": {
    "term": {
      "id.keyword": "songs.csv-1"
    }
  }
}'

This query will only return log statements where the id of the Document being processed is “songs.csv-1”, a Document ID from the Lucille Simple CSV example. This allows you to easily track the lifecycle of the Document as it was processed and published.

curl -XGET "http://localhost:9200/logs/_search" -H 'Content-Type: application/json' -d '{
  "query": {
    "match": {
      "message": "FileHandler"        
    }
  }
}'

This query will return log statements with “FileHandler” in the message. This allows you to track specifically when Documents were created from a JSON, CSV, or XML FileHandler.

OpenSearch Dashboards

By filling an OpenSearch Index with your JSON log statements, you can also build OpenSearch Dashboards to monitor your Lucille runs and analyze their performance.

Setting up the Index Pattern

Once you have OpenSearch Dashboards open, click on the menu in the top left. Select Dashboards Management, and then, on this new page, select Index Patterns. Create an Index Pattern for your logs index. As you click through, be sure to choose @Timestamp in the dropdown for “Time Field”.

(Timestamps are very important in OpenSearch Dashboards. Many features will use the timestamp of a Document in some form.)

Discovery

OpenSearch Dashboards has two major “features” - Dashboards and Discover. We’ll start with Discover.

At the top right of the screen, you’ll see a time range. Only logs within this time range will be shown. It defaults to only include logs within the last 15 minutes, so you’ll likely need to change it. Set absolute time ranges that cover your Lucille run. (The UI for setting these times can be a bit finicky, but keep trying.)

Like before, we can trace the entire lifecycle of a single Document. In the search bar, type in: id:"songs.csv-1". Now, you should only see the log statements relevant to this Document. Each entry will be a large blob of text representing the entire Document. You can select certain fields on the left side of your screen to get a more focused view of each statement.

You can also sort the statements in a certain order. If you hover over Time at the top of the table, you can click the arrow to sort the logs by their timestamps.

Dashboards

Now, let’s see how Dashboards could help you monitor a certain Lucille run. Imagine you have scheduled a Lucille run to execute every day. We can create a Dashboard that will help us quickly see how many warnings/errors took place.

Click add in the top right to create a new panel. Click + Create new, then choose visualization, and then metric. Choose your logs index as the source.

In this new window, for metric, click buckets. Choose a Filters aggregation, and in the filter, include log.level:"WARN" (using Dashboards Query Language). This will display the number of log statements with a level of “WARN”.

A panel displaying the number of warnings.

You can then repeat this process, but for logs with a level of ERROR.

Now, let’s make a chart that’ll display the number of warnings per day. Create a new visualization - this time, a vertical bar. Go to buckets, and add X-axis. Configure it like this:

Configuration for the X-axis bucket.

Then, add another bucket - a split series. Configure it just like the panel - a Filters aggregation, with log.level:"WARN". Now you’ll have a chart tracking the number of warnings per day. And, again, you can do the same for ERROR-level logs.

Your dashboard might look a little something like this (but populated with a little bit of actual data):

A dashboard with four panels, the number of warnings, warnings per day, number of errors, and errors per day.

5.2 - Deploying

A short lead description about this content page.

This is a placeholder page. Replace it with your own content.

Text can be bold, italic, or strikethrough. Links should be blue with no underlines (unless hovered over).

This is the final element on the page and there should be no margin below this.

5.3 - Monitoring

A short lead description about this content page.

5.4 - Troubleshooting

A short lead description about this content page.

Timeouts

Lucille has a default timeout specified in runner.connectorTimeout of 24 hours. You can override this timeout in the configuration file.