File System Connector
The File System Connector retrieves files from a file system.
When you configure a File System Connector, you specify the folders that contain your files. You can also configure filters so that the connector will retrieve or ignore specific files.
The File System Connector differs from most other connectors because it doesn't download or copy any files, instead it crawls the file system and builds a list of files. This process is very quick (about 1000 documents per second).
To get the list of files, the connector uses standard system API calls. For example, when processing a normal directory or even an SMB network share, a File System Connector that is installed on Windows will make calls to the Windows API. When the connector encounters a directory, it processes its contents (and continues recursively if needed) before returning to the parent level. This means that some of the lowest-level directories might be processed first. If you are expecting top-level files to be indexed, please wait until the connector has made more progress. Within a directory, the files are processed in the order that they are returned, which can vary between platforms.
The connector stores information in the task datastore about the files it has successfully processed. If you are using the standalone connector this information is recorded when the connector has sent a document to CFS. If you are using the NiFi connector this information is recorded when the document has been released to the output relationship. Because the connector records files only after they are processed successfully, any that fail are automatically retried when the connector runs the next synchronize cycle.
In subsequent synchronize cycles, the connector performs the same directory traversal. Any files that it hasn't seen before are ingested. When the connector encounters a file that it has seen before, the file is processed again only if it has been modified. (In other words, after the first synchronize cycle the connector processes changes incrementally, which improves ingestion performance.) If you want to process all files again, then you can purge the datastore so that the connector has to perform a full synchronize. In this case the connector has to rebuild its information about what exists in the file system.
Storing this information also allows the connector to identify files that were deleted between synchronize cycles and are no longer present, so that the corresponding documents can be deleted from the IDOL index. The connector issues a delete command for each of the deleted files, and these commands are added to the connector's ingest queue. If a certain percentage of files are found to be deleted (see the configuration parameter MaxDeletionPercentage
), then the connector does not issue any delete commands. This prevents documents being removed unexpectedly - for example when a network share becomes unavailable.
Security Information on Windows
The connector extracts security information (Access Control Lists - ACLs - which specify the users and groups who are permitted read a file).
The connector expands local groups (groups that are created on the machine hosting the file, rather than domain groups) so that they are respected by IDOL.
To extract security information on Windows, the connector requires additional permissions. The connector uses the Windows API to get the ACLs for files (including the parent directory permissions).
These API calls return SIDs, which must be converted to user or group names. The connector uses the Windows API to perform the conversion, and Windows makes a request to the domain controller that is configured on the machine where the connector is installed. In some cases you might need to use an alternate domain controller (for example if you are indexing a network share that is in a different domain to the connector machine). You can do this by setting the AlternateDomain
parameter. Microsoft provide a tool called psgetsid
which performs the same lookups as the connector, so you can use this to help troubleshoot any problems. The connector also has a parameter named DebugSecurity
that you can set to TRUE
, if you want the connector to log more detailed information about ACL generation.
Using the Connector with CFS
By default, your CFS processes the original files that were crawled by the connector. This means that CFS needs the same permissions as the connector.
TIP: If CFS cannot read files from their original location you should set either IngestDataPort
or IngestSharedPath
in the [Ingestion]
section of the File System Connector configuration file. This might be necessary when you install the connector and CFS on separate machines.
- When you set
IngestDataPort
, the connector sends the files to the CFS data port. The data port is configured by theDataPort
parameter in the[Server]
section of the CFS configuration file. - When you set
IngestSharedPath
, the connector copies the files to a shared folder. You must then ensure that CFS can access this location.
The connector does not alter any of the times set on a file - including the last accessed time, because it does not access the files. However, CFS and KeyView do access files. To minimize disruption to other software (for example archive or backup software), CFS reads the times set on a file and then restores the times back to their original values after it has finished.
CFS usually requires much longer to process files than the connector (KeyView filtering takes much longer than reading the file name, for example). As a result the connector can be blocked by CFS. The connector sends batches of documents (see the parameter IngestBatchSize
) to CFS for processing. When the CFS queue is full, it blocks the connector sending more items. A queue of documents will build up in the connector's ingest queue, but eventually that queue might also become full and the connector will stop processing until CFS allows it to resume sending documents.
When CFS accesses a file, this can cause your anti-virus program to perform a scan, so for performance reasons OpenText recommends disabling any anti-virus applications.
Troubleshoot File System Connector
If the File System Connector does not return the data that you expect, check that:
-
the connector is configured to retrieve the correct files. Check that the files are not being excluded by the configuration parameters that you have set, for example by matching (or not matching) a regular expression.
-
the connector has permission to read the files.
Be aware that the permissions granted to the connector can depend on the method you use to start it. If you run the connector from the command line or by double-clicking the executable file, then the connector will run as (and with the permissions of) the current logged-in user. If you run the connector as a service then by default, the connector runs as a system service. You might need to run the connector using a different user account that has access to the files. See Configure User Permissions. Because the connector uses standard system calls, a good way to identify a permissions issue is to log in as the connector user and check that your files can be seen in the output of the
dir
(Windows) orls
(Linux) commands.If you are running the File System Connector in NiFi, the connector will run as the user who is running the NiFi instance.
TIP: If you are running the connector on Windows, you can set the parameters
ReadUsername
andReadPassword
to read files with a different user account to the user who is running the connector. -
the files are not read-only. The File System Connector can retrieve files that are read-only. However, KeyView cannot process read-only
.nsf
and.pst
files. To resolve this issue, you can:-
remove the read-only attribute from these files (the file contents are not modified).
-
configure CFS to copy the files to a temporary directory (the temporary copies are not read-only). To do this, set the
WorkingDirectory
parameter in the CFS configuration file.
-