Answer Server

Answer Server is an ACI server. For details of changes that affect all ACI servers, see ACI Server Framework.

24.3.0

New Features

  • A new answer system has been added for Retrieval Augmented Generation (RAG). Retrieval Augmented Generation (RAG) is a method to combine the generative capabilities of large language models (LLMs) with a system to ensure that the answers generated are from a verifiable source. RAG finds answers in a body of data, and then uses the natural language generation of the LLM to create a human-readable answer.

    You configure an Answer Server RAG system in a similar way to a passage extractor LLM system, with configurations for the LLM modules that you use to generate embeddings for queries to find candidate documents, as well as the LLM module that you use to generate answers. Answer Server retrieves the summaries from candidate documents in your IDOL index, and combines them with the question in a prompt template that you can to send to the LLM.

    You can use a Python or Lua script to connect to the LLM and generate the answers to Answer Server.

    For more information, refer to the Answer Server Help.

  • The following improvements have been made to the LLM configuration that you can use to generate embeddings or perform generative or extractive question answering:

    • You can now perform embedding generation and generative question answering by using a Python script, in the same way as for Lua. To use this option, you set the Type parameter to Python in your embedding generation configuration, or GenerativePython in your LLM Module configuration. You must also set Script to the path to the Python script to use. You can also set RequirementsFile to the path to your requirements.txt file.

    • You can now use a specific revision of a model from Hugging Face, by setting the new ModelRevision parameter in your model configuration.

    • You can now use a private model from Hugging Face, by setting the new AccessToken parameter in your model configuration.

    • You can now use only an offline (cached) version of your model, rather than downloading the latest, by setting the new OfflineMode parameter in your model configuration.

    • You can now use an alternative algorithm to generate answers, by setting the new GreedySearch parameter to False. By default, Answer Server uses the greedy search algorithm, which uses the token with the highest probability as the next token. When you set GreedySearch to False, it uses a multinomial sampling algorithm to choose a random token based on a probability distribution, which can give better results for long sentences.

    • You can now set a limit on the amount of data to generate embeddings for, by setting the new DataLimit parameter in your model configuration.

    For more information, refer to the Answer Server Help

Resolved Issues

  • Using generative models with small input (less than 2 tokens for generative question answering) caused an error.

24.2.0

New Features

  • The method for using LLMs to generate embeddings and to perform generative and extractive question answering has been improved.

    Rather than create your own model files by using a script, Answer Server can now download and cache the models directly from Hugging Face. This change means that Answer Server supports a wider range of models, and it does not require the python script or external python libraries.

    To use LLMs, you must now use the ModelName parameter in your LLM configuration to specify the model to use from Hugging Face. You can also optionally set CacheDirectory to the location that Answer Server must use to store the cached model files.

    For information about the models that you can use for different configurations, refer to the Answer Server Help.

    IMPORTANT: As part of this change, the ModelPath and TokenizerPath parameters have been removed, and are no longer supported.

  • You can now control the precision of the embeddings that Answer Server generates to run vector queries, by using the EmbeddingPrecision parameter in the configuration section for your embedding system. This parameter sets the number of decimal places to use in the embedding values.

  • You can now configure embedding models and generative LLMs to use a CUDA-compatible GPU device, by setting the new Device configuration parameter in the model configuration.

Resolved Issues

There were no resolved issues in this release.

24.1.0

New Features

  • You can now configure Passage Extractor systems to use an LLM to extract or generate answers. To configure these systems, you set the Type to PassageExtractorLLM. Like a standard passage extractor, you must configure the location of an IDOL index to use to find answers, and classifier files to describe the types of different questions.

    For LLM passage extractor, you must also configure the location of model and tokenizer files for the LLM to use to generate or extract answers.

    You can also use these models in a Lua script, for example so that you can access an LLM through a HTTP endpoint.

    For example:

    [passageextractorLLM]
    // Data store IDOL
    IdolHost=localhost
    IdolAciport=6002
    Type=PassageExtractorLLM
    // Classifier Files
    ClassifierFile=./passageextractor/classifiertraining/svm_en.dat
    LabelFile=./passageextractor/classifiertraining/labels_en.dat
    // Module to use
    ModuleID=LLMExtractiveQuestionAnswering-Small
    
    [LLMExtractiveQuestionAnswering-Small]
    Type=ExtractiveQuestionAnsweringLLM
    ModelPath=modelfiles/model.pt
    TokenizerPath=modelfiles/tokenizer.spiece.model

    For more information, refer to the Answer Server Help.

  • When you use a passage extract LLM system, the Ask action returns a highlighted paragraph in the response metadata to show the passage that the answer was extracted from, to allow you to verify automatically generated answers.

  • You can now configure a Passage Extractor or Passage Extractor LLM system to run vector queries against the IDOL Content component to identify candidate documents that might contain answers to an input question. You can use this option when you index vectors in your IDOL Content component and want to use vector search to retrieve answers.

    To use this option, you must set the AnswerCandidateEmbeddingsSettings parameter in your system configuration section to the name of a configuration section where you configure the Content vector field and an embeddings configuration for how to generate embeddings to send to Content. For example:

    [PassageExtractorSystem]
    idolhost=localhost
    idolaciport=6002
    type=passageextractor
    ...
    AnswerCandidateEmbeddingsSettings=VectorSettings
    
    [VectorSettings]
    EmbeddingsConfig=EmbeddingsSystem
    VectorField=VECTORA
    
    [EmbeddingsSystem]
    Type=Transformer
    ModelPath=path/to/model.pt
    TokenizerPath=path/to/tokenizer.spiece.model
    ModelMaxSequenceLength=128

    For more information, refer to the Answer Server Help.

Resolved Issues

There were no resolved issues in this release.

23.4.0

There were no new features or resolved issues in this release.

23.3.0

There were no new features or resolved issues in this release.

23.2.0

There were no new features or resolved issues in this release.