DocumentEmbeddingsServiceImpl

A service that is used by the DocumentEmbeddings processor to generate embeddings for the text in your documents.

To use this service, you must have an embeddings model to use. You can create an embeddings model by using the export_transformers_model.py script, which is included in the IDOL Document Embeddings package. See Generate an Embeddings Model.

Properties

Name Default Value Description
Idol License Service   An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server.
Model Path  

The path to the model file to use to generate embeddings.

You create a model file by using the export_transformers_model.py script, which is in the IDOL Document Embeddings package.

Tokenizer Path  

The path to the tokenizer file for your embeddings model.

You create the tokenizer file with the model file by using the export_transformers_model.py script, which is in the IDOL Document Embeddings package.

Model Max Sequence Length  

The maximum chunk size permitted by the model that you use to generate embeddings.

Model Minimum Final Sequence Length   The minimum length of the final chunk of text used to generate embeddings, when multiple embeddings are required for a piece of text.
Model Sequence Overlap  

The length of overlap required for text used to generate successive embeddings, when multiple embeddings are required for a piece of text.

Generate an Embeddings Model

The sentence transformer models use vector models from Hugging Face. The IDOL Document Embeddings ZIP package includes a Python script that you can use to generate model files in a format that NiFi can use to generate your embeddings.

You can choose any sentence-transformer sentencepiece model.

The script export_transformers_model.py is included in the tools directory of the IDOL Document Embeddings ZIP package. This directory also includes a requirements.txt file to allow you to install the necessary dependencies for the script.

To create your model

  1. Install the requirements for the export_transformers_model.py script by using pip with the requirements.txt file. For example:

    pip install -r requirements.txt
  2. Run the export_transformers_model.py script with the following arguments:

    model

    The model to download from Hugging Face.

    model-type The type of model to create. For embedding generation, set this argument to sentence-tranformer.

    You can also optionally set the following arguments:

    output The file name to use for the generated model file. The default value is model.pt.
    output-spiece The file name to use for the sentencepiece tokenizer file. The default value is spiece.model.
    cache The location for the cache for model downloads. The default value is .cache.

    When the script finishes, it outputs the name and location of the model and tokenizer files that it creates. You use these values in your configuration. For example:

    ModelPath: sentence-t5-large.pt
    TokenizerPath: spiece.model