Connectors
A component that retrieves data from a source system, packages the data into “documents,” and publishes them.
Lucille Connectors
Lucille Connectors are components that retrieve data from a source system, packages the data into “Documents”, and publishes them to a pipeline.
To configure a Connector, you have to provide its class (under class
) in its config. You also need to specify a name
for the Connector.
Optionally, you can specify the pipeline
, a docIdPrefix
, and whether the Connector requires a Publisher to collapse
.
You’ll also provide the parameters needed by the Connector as well. For example, the SequenceConnector
requires one parameter, numDocs
,
and accepts an optional parameter, startWith
. So, a SequenceConnector
Config would look something like this:
{
name: "Sequence-Connector-1"
class: "com.kmwllc.lucille.connector.SequenceConnector"
docIdPrefix: "sequence-connector-1-"
pipeline: "pipeline1"
numDocs: 500
startWith: 50
}
The lucille-core
module contains a number of commonly used connectors. Additional connectors with a large number of dependencies are provided as optional plugin modules.
Lucille Connectors (Core)
The following connectors are deprecated. Use FileConnector instead, along with a corresponding FileHandler.
Lucille Connectors (Plugins)
1 - RSS Connector
A Connector that publishes Documents representing items found in an RSS feed.
The RSSConnector
allows you to publish Documents representing the items found in an RSS feed of your choice. Each Document will
(optionally) contain fields from the RSS items, like the author, description, title, etc. By default, the Document IDs will be the
item’s guid
, which should be a unique identifier for the RSS item.
You can configure the RSSConnector
to only publish recent RSS items, based on the pubDate
found on the items.
Also, it can run incrementally, refreshing the RSS feed after a certain amount of time until you manually stop it. The RSSConnector
will avoid publishing Documents for the same RSS item more than once.
The Documents published may have any of the following fields, depending on how the RSS feed is structured:
author
(String)categories
(List<String>)comments
(List<String>)content
(String)description
(String)enclosures
(List<JsonNode>). Each JsonNode contains:type
(String)url
(String)- May contain
length
(Long)
guid
(String)isPermaLink
(Boolean)link
(String)title
(String)pubDate
(Instant)
3 - File Connector
A Connector that, given a path to S3, Azure, Google Cloud, or the local file system, traverses the content at the given path and publishes Lucille documents representing its findings.
Source Code
The file connector traverses a file system and publishes Lucille documents representing its findings. In your Configuration, specify
pathsToStorage
, representing the path(s) you want to traverse. Each path can be a path to the local file system or a URI for a supported
cloud provider.
Working with Cloud Storage
When you are providing FileConnector with URIs to cloud storage, you also need to apply the appropriate configuration for any
cloud providers used. For each provider, you’ll need to provide a form of authentication; you can optionally
specify the maximum number of files (maxNumOfPages
) that Lucille will load into memory for a given request.
- Azure: Specify the needed options in
azure
in your Config. You must provide connectionString
, or you must provide accountName
and accountKey
. - Google: Specify the needed options in
gcp
in your Config. You must provide pathToServiceKey
. - S3: Specify the needed options in
s3
in your Config. You must provide accessKeyId
, secretAccessKey
, and region
. - For each of these providers, in their configuration, you can optionally include
maxNumOfPages
as well.
Applying FileHandlers
Some of the files that your FileConnector
encounters will, themselves, contain data that you want to extract more documents from! For example, the FileConnector
may encounter a .csv
file, where each row itself represents a Document to be published. This is where FileHandlers come in - they will individually process these files
and create more Lucille documents from their data. See File Handlers for more.
In order to use File Handlers, you need to specify the appropriate configuration within your Config - specifically, each File Handler
you want to use will be a map within the fileOptions
you specify. You can use csv
, json
, or xml
.
See the documentation for each File Handler to see what arguments are needed / accepted.
File Options
File options determine how you handle and process files you encounter during a traversal. Some commonly used options include:
getFileContent
: Whether, during traversal, the FileConnector should add an array of bytes representing the file’s contents to the Lucille document it publishes.
This will slow down traversal significantly and is resource intensive. On the cloud, this will download the file contents.handleArchivedFiles
/handleCompressedFiles
: Whether you want to handle archive or compressed files, respectively, during your traversal. For cloud files, this will download the file’s contents.moveToAfterProcessing
: A path to move files to after processing.moveToErrorFolder
: A path to move files to if an error occurs.
Filter Options
Filter options determine which files will/won’t be processed & published in your traversal. All filter options are optional.
If you specify multiple filter options, files must comply with all of them to be processed & published.
includes
: A list of patterns for the only file names that you want to include in your traversal.excludes
: A list of patterns for file names that you want to exclude from your traversal.lastModifiedCutoff
: Filter out files that haven’t been modified recently. For example, specify "1h"
, and only
files modified within the last hour will be processed & published.lastPublishedCutoff
: Filter out files that were recently published by Lucille. For example, specify "1h"
, and only
files published by Lucille more than an hour ago (or never published) will be processed & published. Requires you to provide state configuration,
otherwise, it will not be enforced!
State
The File Connector can keep track of when files were last known to be published by Lucille. This allows you to use FilterOptions.lastPublishedCutoff
and
avoid repeatedly publishing the same files in a short period of time.
In order to use state with the File Connector, you’ll need to configure a connection to a JDBC-compatible database. The database
can be embedded, or it can be remote.
It’s important to note that File Connector state is designed to be efficient and lightweight. As such, keep a few points in mind:
- Files that were recently moved / renamed files will not have the
lastPublishedCutoff
applied. - In your File Connector configuration, it is important that you consistently capitalize directory names in your
pathToStorage
, if you are using state. - Each database table should be used for only one connector configuration.