C API Advanced Programming Tutorial

This tutorial follows on from the C API Programming Tutorial.

This tutorial helps you to:

  • familiarize yourself with more advanced Filter SDK functionality.

  • adapt the previous sample program to work on streams, rather than files.

NOTE: This tutorial assumes that you have already completed Getting Started with KeyView Filter and the C API and C API Programming Tutorial.

Setup

Download the following resources for this tutorial:

Using a Custom Stream

Until now, you have worked with KeyView operating on files on disk. In some cases you might want to get KeyView to operate on streams instead. For example, you might want to use KeyView in stream mode when:

  • The file you are dealing with is in-memory, because it was output by another operation. You can use a custom input stream to read the file directly from memory, instead of writing it out to a file first.

  • You want to get the filtered text in small chunks instead of all at once. This approach has the following advantages:

    • You can process the output data in parallel with filtering the rest of the text. Parallel processing can minimize the time it takes to filter and process the text.

    • You can choose to stop filtering when the application has all the text it needs, which can save valuable resources. This approach is called partial filtering.

  • You want to extract subfiles into memory, instead of storing them on disk.

  • You do not have the whole file available to begin with. In this case, you can use a custom input stream to retrieve only the required parts of the file as KeyView requests them.

Defining a Custom Input Stream

You can implement a custom stream by filling out a KVInputStream structure with functions that perform the appropriate actions. Each of these functions are equivalent to the ANSI counterparts (fopen, fread, and so on), except that several functions return a BOOL rather than an error code.

To illustrate how to use a custom stream, the following example defines a very simple stream that forwards to the file-based operations.

Copy
typedef struct
{
    const char* filename;
    FILE* fp;
} StreamInfo;


BOOL pascal streamOpen(KVInputStream* stream)
{
    if(!stream || !stream->pInputStreamPrivateData)
    { 
        return FALSE;
    }
    StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;
    
    if (info->fp == NULL)
    {
        info->fp = fopen(info->filename, "rb");
    }

    if (info->fp)
    {
        fseek(info->fp, 0, SEEK_SET);
    }

    return info->fp != NULL;
}

UINT pascal streamRead(KVInputStream* stream, BYTE * buffer, UINT size)
{
    if(!stream || !stream->pInputStreamPrivateData)
    { 
        return 0;
    }
    StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;
    return fread(buffer, 1, size, info->fp);
}

BOOL pascal streamSeek (KVInputStream* stream, long offset, int whence)
{
    if(!stream || !stream->pInputStreamPrivateData)
    { 
        return FALSE;
    }
    StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;

    return fseek(info->fp, offset, whence) == 0;
}

long pascal streamTell(KVInputStream* stream)
{
    if(!stream || !stream->pInputStreamPrivateData)
    { 
        return -1;
    }
    StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;

    return ftell(info->fp);
}

BOOL pascal streamClose(KVInputStream* stream)
{
    if(!stream || !stream->pInputStreamPrivateData)
    { 
        return FALSE;
    }
    StreamInfo* info = (StreamInfo*)stream->pInputStreamPrivateData;
    int retval = fclose(info->fp);
    info->fp = NULL;
    
    return retval == 0;
}

StreamInfo info = {pathToInputFile, NULL};
KVInputStream stream;
stream.pInputStreamPrivateData = &info;
stream.lcbFilesize = 0;
stream.fpOpen = streamOpen;
stream.fpRead = streamRead;
stream.fpSeek = streamSeek;
stream.fpTell = streamTell;
stream.fpClose = streamClose;

When you know the size of the document when you create the stream, you can use this information to fill out the lcbFilesize member. This option can reduce the number of seeks required, because KeyView does not need to seek to the end of the file to determine the size.

When you do not know the size, you must set this member to zero. Not setting this member results in undefined behavior.

Opening a Document From a Stream

After you define a stream, you can create a KVDocument from the stream by calling fpOpenDocumentFromStream(). This KVDocument functions in the same way as a document created using fpOpenDocumentFromFile().

You must not open a second document from a stream until you have closed the first document.

Copy

KVDocument document = NULL;

error = filter.fpOpenDocumentFromStream(session, &stream, &document);

//Pass document to KeyView functions

filter.fpCloseDocument(document);
            

Extracting Subfiles Using Streams

KeyView lets you access subfiles as streams, rather than needing to extract them to disk.

To access a subfile as a stream, use the fpOpenSubFile() function, rather than using fpExtractSubFile(). The KVExtractSubFileArg structure is the same as for files (see Extracting Subfiles).

Copy

KVInputStream substream = NULL;
error = extract.fpOpenSubFile(fileContext, &extractArg, &substream);

//Use sub file stream
extract.fpCloseSubFile(substream);
            

You can use the fpGetExtractInfo() function to retrieve the KVSubFileExtractInfo structure associated with the subfile, and the fpGetExtractStatus() function to return more information about any errors encountered when using the subfile stream.

When you pass this stream back into the KeyView Filter interface, it must be to a different session to the one that you used to call fpOpenSubFile(). Because initializing a new session can incur a performance cost, OpenText recommends that you do this once, and then reuse this session for each subfile.

Filtering Text Using Streams

For some use cases, you might not need all the text from the file, or you might want to analyze the text in small pieces. The fpFilter() function outputs the text in chunks, by filling out a KVFilterOutput structure. You must also free this structure by using the fpFreeFilterOutput() function.

The end of the stream is indicated by an empty KVFilterOutput structure. You do not need to free the empty structure.

By requesting text in chunks, a mutli-threaded application can often filter and process all the text from a file in a shorter time, by passing the text to downstream processing on another thread, while the first thread continues to get the next chunk from the stream.

Partial Filtering

You might want to stop processing before you have filtered the entire file, for example because you have already found a search term, or because you have hit a resource threshold. You can safely stop processing, as long as you still call fpFreeFilterOutput() and fpCloseDocument().

You can optionally keep track of how many bytes have been output, by accumulating the cbText field of KVFilterOutput.

Copy

uint64_t totalSize = 0;

while(1)
{
    KVFilterOutput output = {0};
 
    error = filter->fpFilter(document, &output);
     
    if(error != KVError_Success)
    {
        return error;
    }
 
    if(output.cbText == 0)
    {
        break;
    }

    totalSize += output.cbText;

    //Use filter output

    filter->fpFreeFilterOutput(session, &output);
}            

Conclusion

After you have completed the C API Programming Tutorial, and this more advanced tutorial, you should have a good understanding of the KeyView Filter SDK C API, allowing you to automatically detect the file format and extract metadata, text, and subfiles.