ExternalPython

Run a document through an external Py4J Python environment.

Why Use It?

ExternalPython delegates per-document processing to an external Python process using Py4J. Lucille serializes the Document into a request, calls a Python function, receives a JSON response, and applies that response back onto the document.

When To Use It

Use ExternalPython when you need one or more of the following:

  • Real Python compatibility (including packages with native dependencies).
  • Dependency management via a requirements.txt installed into a managed venv.
  • Process isolation apart from the JVM.

When To Use EmbeddedPython Instead

Avoid ExternalPython and use EmbeddedPython when you need one or more of the following:

  • Minimal operational overhead (ports, subprocess lifecycle, venv creation, pip installs).
  • No use of any external Python libraries or native dependencies that require a real Python environment.
  • Lightweight field enrichment/transformation.

Restrictions

Your python file must be in one of the following directories that start in the current working directory that is running lucille:

  • ./python
  • ./src/main/resources
  • ./src/test/resources
  • ./src/test/resources/ExternalPythonTest (for testing)

Example

Input Document

{
  "id": "doc-1",
  "title": "Hello",
  "author": "Test",
  "views": 123
}

Python Script

def process_document(doc):
    title = doc["title"]
  
    return {
        "title": title.upper()
    }

Python Returns

{
  "title": "HELLO"
}

Output Document

{
  "id": "doc-1",
  "title": "HELLO"
}

Config Parameters

{
  name: "ExternalPython-Example"
  class: "com.kmwllc.lucille.stage.ExternalPython"

  scriptPath: "/path/to/my_script.py"

  # Optional
  pythonExecutable: "python3"
  requirementsPath: "/path/to/requirements.txt"
  functionName: "process_document"
  port: 25333
}

Example (NumPy)

Input Document

{
  "id": "doc-2",
  "values": [1, 2, 3, 4, 5]
}

Python Script

import numpy as np

def process_document(doc):
    arr = np.array(doc["values"], dtype=float)
  
    return {
        "values": doc["values"],
        "mean": float(np.mean(arr)),
        "stddev": float(np.std(arr))
    }

Output Document

{
  "id": "doc-2",
  "values": [1, 2, 3, 4, 5],
  "mean": 3.0,
  "stddev": 1.41
}

requirements.txt

numpy

Config Parameters

{
  name: "ExternalPython-Numpy"
  class: "com.kmwllc.lucille.stage.ExternalPython"

  scriptPath: "/path/to/my_numpy_script.py"
  requirementsPath: "/path/to/requirements.txt"
  
  # Optional
  pythonExecutable: "python3"
  functionName: "process_document"
  port: 25333
}