Connectors

A component that retrieves data from a source system, packages the data into “documents,” and publishes them.

1: RSS Connector
2: Database Connector
3: File Connector

Lucille Connectors

Lucille Connectors are components that retrieve data from a source system, packages the data into “Documents”, and publishes them to a pipeline.

To configure a Connector, you have to provide its class (under class) in its config. You also need to specify a name for the Connector. Optionally, you can specify the pipeline, a docIdPrefix, and whether the Connector requires a Publisher to collapse.

You’ll also provide the parameters needed by the Connector as well. For example, the SequenceConnector requires one parameter, numDocs, and accepts an optional parameter, startWith. So, a SequenceConnector Config would look something like this:

{
  name: "Sequence-Connector-1"
  class: "com.kmwllc.lucille.connector.SequenceConnector"
  docIdPrefix: "sequence-connector-1-"
  pipeline: "pipeline1"
  numDocs: 500
  startWith: 50
}

The lucille-core module contains a number of commonly used connectors. Additional connectors with a large number of dependencies are provided as optional plugin modules.

Lucille Connectors (Core)

Abstract Connector - Base implementation for a Connector.
Database Connector
File Connector
Sequence Connector - Generates a certain number of empty Documents.
Solr Connector
RSS Connector

The following connectors are deprecated. Use FileConnector instead, along with a corresponding FileHandler.

Lucille Connectors (Plugins)

Parquet Connector

1 - RSS Connector

A Connector that publishes Documents representing items found in an RSS feed.

The RSSConnector

The RSSConnector allows you to publish Documents representing the items found in an RSS feed of your choice. Each Document will (optionally) contain fields from the RSS items, like the author, description, title, etc. By default, the Document IDs will be the item’s guid, which should be a unique identifier for the RSS item.

You can configure the RSSConnector to only publish recent RSS items, based on the pubDate found on the items. Also, it can run incrementally, refreshing the RSS feed after a certain amount of time until you manually stop it. The RSSConnector will avoid publishing Documents for the same RSS item more than once.

The Documents published may have any of the following fields, depending on how the RSS feed is structured:

author (String)
categories (List<String>)
comments (List<String>)
content (String)
description (String)
enclosures (List<JsonNode>). Each JsonNode contains:
- type (String)
- url (String)
- May contain length (Long)
guid (String)
isPermaLink (Boolean)
link (String)
title (String)
pubDate (Instant)

2 - Database Connector

Database Connector

3 - File Connector

A Connector that, given a path to S3, Azure, Google Cloud, or the local file system, traverses the content at the given path and publishes Lucille documents representing its findings.

Source Code

The file connector traverses a file system and publishes Lucille documents representing its findings. In your Configuration, specify pathsToStorage, representing the path(s) you want to traverse. Each path can be a path to the local file system or a URI for a supported cloud provider.

Working with Cloud Storage

When you are providing FileConnector with URIs to cloud storage, you also need to apply the appropriate configuration for any cloud providers used. For each provider, you’ll need to provide a form of authentication; you can optionally specify the maximum number of files (maxNumOfPages) that Lucille will load into memory for a given request.

Azure: Specify the needed options in azure in your Config. You must provide connectionString, or you must provide accountName and accountKey.
Google: Specify the needed options in gcp in your Config. You must provide pathToServiceKey.
S3: Specify the needed options in s3 in your Config. You must provide accessKeyId, secretAccessKey, and region.
For each of these providers, in their configuration, you can optionally include maxNumOfPages as well.

Applying FileHandlers

Some of the files that your FileConnector encounters will, themselves, contain data that you want to extract more documents from! For example, the FileConnector may encounter a .csv file, where each row itself represents a Document to be published. This is where FileHandlers come in - they will individually process these files and create more Lucille documents from their data. See File Handlers for more.

In order to use File Handlers, you need to specify the appropriate configuration within your Config - specifically, each File Handler you want to use will be a map within the fileOptions you specify. You can use csv, json, or xml. See the documentation for each File Handler to see what arguments are needed / accepted.

File Options

File options determine how you handle and process files you encounter during a traversal. Some commonly used options include:

getFileContent: Whether, during traversal, the FileConnector should add an array of bytes representing the file’s contents to the Lucille document it publishes. This will slow down traversal significantly and is resource intensive. On the cloud, this will download the file contents.
handleArchivedFiles/handleCompressedFiles: Whether you want to handle archive or compressed files, respectively, during your traversal. For cloud files, this will download the file’s contents.
moveToAfterProcessing: A path to move files to after processing.
moveToErrorFolder: A path to move files to if an error occurs.

Filter options determine which files will/won’t be processed & published in your traversal. All filter options are optional. If you specify multiple filter options, files must comply with all of them to be processed & published.

includes: A list of patterns for the only file names that you want to include in your traversal.
excludes: A list of patterns for file names that you want to exclude from your traversal.
lastModifiedCutoff: Filter out files that haven’t been modified recently. For example, specify "1h", and only files modified within the last hour will be processed & published.
lastPublishedCutoff: Filter out files that were recently published by Lucille. For example, specify "1h", and only files published by Lucille more than an hour ago (or never published) will be processed & published. Requires you to provide state configuration, otherwise, it will not be enforced!

State

The File Connector can keep track of when files were last known to be published by Lucille. This allows you to use FilterOptions.lastPublishedCutoff and avoid repeatedly publishing the same files in a short period of time.

In order to use state with the File Connector, you’ll need to configure a connection to a JDBC-compatible database. The database can be embedded, or it can be remote.

It’s important to note that File Connector state is designed to be efficient and lightweight. As such, keep a few points in mind:

Files that were recently moved / renamed files will not have the lastPublishedCutoff applied.
In your File Connector configuration, it is important that you consistently capitalize directory names in your pathToStorage, if you are using state.
Each database table should be used for only one connector configuration.