Document ingestion for search and analytics should be made easier. JesterJ is an attempt to satisfy the document ingestion wishlist, with initial focus on support for ingesting documents from a local disk into a running instance of Apache Solr, but with the intention on supporting connectors for a wide variety of sources, and destinations.

Project Status

First Beta Released on 9/13/2017! Please give us a try. Feedback bugs and contributions all welcome! 1.0 final release is expected Before 2018

Code

At this point the interesting stuff starts here. The "control" area will eventually be where the web application to control the system goes, but nothing there is at all in use yet.

The Document Ingestion Wish List

(please pardon github's terrible list format that uses the same style for every level)

  1. Collect Documents
    1. From one to many sources/locations
    2. From various types of sources
      1. Disk
      2. Website
      3. Database
      4. Many More...
    3. Detecting
      1. New Documents
      2. Changed Documents
      3. Deleted Documents
  2. Process Documents
    1. With an ordered set of steps
    2. Steps arranged in a Directed Acyclic Graph of
      1. Sources
      2. Transformers
      3. Routers
      4. Senders
    3. With visualization of the graph
    4. Capable of Live editing
    5. Capable of Parallel running
    6. Capable of Orderly Transition to a new configuration.
  3. Deliver Documents
    1. In batches of any size
    2. To destinations of various types
      1. Apache Solr
      2. ElasticSearch
      3. Other systems?
  4. With Fault Tolerance
    1. Vs. Document Failures
      1. TRANSIENT_FAILURE - Limited retries
      2. PERMANENT_FAILURE - Fatal Exceptions while processing, Retries exceeded
    2. Vs. Shutdown and Crashes via state tracking
      1. DETECTED - Document needing to be processed
      2. PROCESSING - Document has started traveling through the system
      3. SENT - Document Left the ingestion system, and is being handled by the destination.
      4. LIVE - Document is now available to users in the destination system (Callback)
  5. With High Availability
    1. Avoiding Resource Exhaustion
    2. Minimize chance of program termination
    3. Provide redundancy as desired
    4. Self Healing
    5. Buffering of input to handle unexpected massive influx
  6. With good support for change
    1. Live Edits - for development and emergencies
    2. Parallel running and graceful cut over of versions of the DAG
    3. Configurations exportable in a text format that facilitates
      1. Version Control
      2. Reproducibility for QA
      3. Continuous deployment
    4. Visualization tools to make the DAG and the flow of documents easy to perceive
    5. Configuration in a format supported by common IDEs
    6. Support for re-indexing of previously indexed data.
  7. Easy deployment
    1. Nodes should be started by a single command line
    2. Nodes should automatically be made available in the control console without complicated setup
    3. All Node-Node and Node-Console communication should be secure by default.
    4. Node-Node communication should "just work" and not require user setup.
    5. Portions of the DAG should be assignable to specific nodes to support node types specialized for a particular task
  8. Monitoring and Control
    1. Control application that will surface editing, visualization and alerting features.
    2. Generic support for integration with existing monitoring solutions
    3. Distinguish types of failures
      1. Hard Failures (Failed docs, lost nodes, Disk space exhaustion)
      2. Performance Failures (Backlog growth, Unacceptable indexing latency)
      3. Detection of likely future failures: CPU, Disk, Memory
    4. Support for performance tuning & investigation
      1. Status of all nodes
      2. Flow statistics
      3. Rate Limiting Step detection for specific path through DAG based on stats.
    5. Orderly start stop of system