Introduction

Building a cluster of Apache NiFi instances provides several advantages over a single instance:

  • Increased throughput. Depending on the amount of data that you want to ingest, you might find that a single NiFi instance, installed on a single server, does not provide sufficient throughput. You can increase the number of threads available to your processors, so that they can process multiple documents in parallel, but you are limited by the resources available on a single server. Distributing work across several servers allows you to increase throughput.
  • Failover. Running NiFi instances in a cluster provides failover. You can continue ingesting data if one of your servers becomes unavailable.

The nodes in an Apache NiFi cluster are managed by Apache ZooKeeper. The Apache NiFi documentation recommends that you run ZooKeeper on either three or five nodes. You can add more nodes to your cluster, but the additional nodes do not need to run ZooKeeper. For information about how to configure the initial (ZooKeeper) nodes, see Set up a NiFi Cluster. Then, if you want to add additional nodes to your cluster, see Set up Additional Nodes.

TIP: You can find more detailed information about clustering in the Apache NiFi documentation. This document describes only the minimum requirements.

Changes to your DataFlow

All of the nodes in a NiFi cluster run the same data flow. Any changes that you make to your data flow on one node are replicated on the other nodes.

TIP: Although an IDOL Connector appears in the data flow on all nodes, it only runs on a single node. If that node becomes unavailable, another node takes over. Subsequent processors, such as KeyView Extraction and Filtering, run simultaneously on all of the nodes in your cluster.

The FlowFiles that a connector produces need to be distributed across all of the nodes in the cluster. This does not happen automatically. OpenText recommends that you configure the output connection from the connector and set the Load Balance Strategy to Round Robin. For information about how to do this, refer to the Apache NiFi documentation.

To run a NiFi cluster, you must use an external database for storing state information. Many NiFi Ingest processors need to store state information. For example, IDOL Connectors store information about what they have retrieved from a data repository. This information needs to be in an external database so that it is accessible to all of the nodes in the cluster. Configure the connection to your database server by creating a database service (see Create a Database Service). When you configure the IDOL connectors in your data flow, set the property State Database Service to the name of the database service that you created.

The files that your connectors download from your data repositories must also be accessible to all of the nodes in a cluster. When you configure your connectors, set the property adv:IngestSharedPath to a location, such as a shared folder, that is accessible from all of the nodes in the cluster. Alternatively, set the property adv:FlowFileEmbedFiles to TRUE, so that the binary file content is included in the FlowFiles created by the connector. For more information about these properties, see Advanced Connector Properties.