AutoCategorizeGenerator
A processor that identifies potential categories in the IDOL documents that it sees.
The processor allows indexing of documents into an IDOL Content component in batches (you can configure the size of the batches). After Content has fully indexed a batch, this processor attempts to cluster the documents in the index. If clustering reaches a stable state, the processor creates categories that the AutoCategorizeLabeller processor can use to categorize documents.
Properties
Name | Default Value | Description |
---|---|---|
Category Host | The host of the IDOL Category component to use to categorize documents. | |
Category Port | The port of the IDOL Category component to use to categorize documents. | |
Min Categories | 9 | The minimum number of categoreis to create (this value is also the minimum number of clusters to identify in the ingested documents). |
Stability Factor | 0.8 | The minimum proportion of clusters (between 0.0 and 1.0) that must persist across consecutive steps of the clustering process for it to be regarded as stable. |
Batch Timeout | 60 | The maximum number of seconds allowed for an index document batch to complete. |
Batch Size | 1000 | The number of documents in an index document batch. |
Max Unstable Docs | 10000 | The number of documents after which the process is abandonded if clustering is not stable. |
Distributed Cache Service | The identifier of the Distributed Cache Service used to communicate state between the AutoCategorizeGenerator processor and the AutoCategorizeLabeller processor. |
NOTE: The template does not include the DistributedMapCacheServer
controller service that is required to make it work. You must set one of these up yourself in your NiFi ingestion process, and configure both processors to use it.
The DistributedMapCacheServer
is a standard NiFi service. For more information, see https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-distributed-cache-services-nar/1.5.0/org.apache.nifi.distributed.cache.server.map.DistributedMapCacheServer/index.html.
Relationships
Name | Description |
---|---|
success | Successfully processed FlowFiles are routed to this relationship. |