The sample sessions are broken out into three parts: one to add the demo sampling tables to your XDB database TUTORIAL location, one to perform data inventory, and one to perform the actual sampling.
Note: Sessions in this tutorial must be done in order.
In this session, you prepare your sampling data by adding tables to your XDB database TUTORIAL location, which is necessary to complete the sampling exercises.
cd InstallationDirectory\mfsql\bin
where InstallationDirectory specifies the directory where Data Express 4.0 is installed, for example, c:\Program Files\Micro Focus\Data Express 4.0.
xwiz40n /b sampling.sql
In this session, you prepare your Data Express environment for the sampling data stores and add metadata for the sampling data stores into the Knowledge Base:
Data Builder lets you set your workspace as part of the Data Inventory process. You must specify your workspace to organize and sort data on the basis of a logical model:
code in the Company Code box.
During the definition of a company, the backup company associated with it can also be defined.
A Backup Company is useful when all the information concerning the files that undergo the Life Cycle procedure needs to be preserved.
Click OK to save your specification and to close the window.
Continue to the next section.
Perform this procedure to map and load the definitions from your data stores into the XDB Knowledge Base. The information from your data stores will be mapped to your workspace.
Load data store information into the Knowledge Base:
An Insert tables window appears, followed by an Information window. After the metadata has fully loaded, an Information window appears that displays data store information including the number of data stores.
The data for this session is actually located in an XDB database that resides on the Windows machine where you are running your Data Express client software. Typically when this is the case, you would simply execute sampling by clicking Start in the Distributed Sampler window; there is no need to export sampling configuration, execute sampling, and import results.
In this session however, sampling is executed in a similar fashion as it would be in a UNIX environment. Therefore, sampling configuration must be exported, sampling executed, and results imported.
In this session, you:
Perform this procedure to verify your sampling configuration settings and to export information about the data store to output files.
Enabling sampling changes the classification number value for the data store to a 1.
Classification numbers are used to restrict the data stores that get sampled. By default, all data stores are assigned a Classification number of 0, which means that sampling is disabled. When sampling the data stores within your work environment, all enabled data stores with Classification number values of 1 will be sampled.
Classification numbers can be changed in the Properties window to provide further granularity in restricting which data stores get sampled as based on user-defined criteria. The data store remains enabled as long as the Classification number is not 0.However, just because sampling is enabled does not mean that the data store is a candidate for sampling as this is controlled by Classification number in the Distributed Sampler.
If the Classification number set for the data store is less than or equal to the Classification number specified in the Distributed Sampler, the data store will be sampled. Likewise, if the data store Classification number is greater than the Classification number in the Distributed Sampler, the data store will not be sampled.
Tip: You can assign classification numbers to your data stores based on how often you want to sample data. For example, 1 could represent daily sampling, 2 could represent weekly sampling, and 3 could represent monthly sampling.
For this exercise, all data stores should be sampling candidates and all sampling options should be utilized.
Notes:
Important: Under normal circumstances, you would need to copy the files sampling.dat and method.rc to the config subdirectory on the machine where your source data store is located. However, in this exercise the files are already in the correct location.
The sampling.log file that was generated during sampling execution must be loaded into the Knowledge Base.
Once you have exported sampling configuration information, executed sampling, and loaded the results into the Knowledge Base, you can view the results.
When compressed sampling is performed, a fingerprint is created. The fingerprint is a unique graphical representation of the distribution of data for the sampled data element. When the results for compressed sampling are loaded, the fingerprint is created.
The Data Store Data Elements list in the left pane shows the data elements for the data store you selected.
You can also select the image for the data element in the Data Element Samples grid. To ensure that you have selected the correct one, the related data element name is highlighted in the Data Store Data Elements list.
Sampling results for the numeric data element PRGADDR are displayed.
Known Restriction: Currently Data Express only shows the first 1000 distinct values and ranges, which are sorted either numerically or alphanumerically based on the data element type. When there are more than 1000 distinct data element values, the Data Element Value ends at the 1000th unique value instead of the greatest data element value.
Sampling results for alphanumeric data element ADDRESS are displayed.
Review the data element fingerprint in the Zoomed Data Element Sample. The fingerprint is the result of performing compressed sampling.
Section Type | Example | Description |
Character Distribution |
A range of characters in alphabetical order is provided to show how many data elements begin with that character. |
|
Number Distribution |
A range of numbers is provided to show how many data elements begin with the indicated number. In the provided example, no data elements begin with numbers. |
|
Type Summary |
An alphanumeric data element can actually be a numeric-only data element or alphanumeric data element. The type summary provides a count of how many data elements fall into either category. In the provided example, all data elements are alphanumeric in type. |
|
Field Length |
The numbers on the vertical y-axes represent the ranges of values specified. The numbers on the horizontal x-axes represent the position of the actual number in that range. For instance, 90 in the provided example means that there are 90 values with a length equal to 11. |
You can create a new class to be used for sampling purposes, or you can use a predefined class. In this exercise you will create a sampling class named SAMPNUM.
The class SAMPNUM is now listed in the List of Classes.
In Data Express, you can correlate data elements if their value distributions are similar. To do this, you must first associate a fingerprint, which represents the desired distribution, to a class.
This fingerprint becomes the prototype that is used determine the class assignment for other data elements. If the prototype and the fingerprint for a data element are deemed similar based on a calculated confidence value, the data element is also associated with the prototype class.
By highlighting the SAMPNUM class description in the All Classes pane and then by dragging the fingerprint for the PRGNAME data element to the Class Samples cell, the SAMPNUM class is then assigned to the fingerprint. This fingerprint becomes the prototype. This action also associates the SAMPNUM class to the PRGNAME data element.
In this exercise you are associating the class SAMPNUM (which is associated to your sampling prototype) to other data elements with sampling data distributions similar to the prototype fingerprint.
The level of similarity between distributions depends on the thresholds you set when importing the class information. If the prototype and the fingerprint for a data element are deemed similar based on a calculated confidence value, the data element is also associated with the prototype class.
When comparing the fingerprint for a data element to the prototype fingerprint, internal Data Express confidence values are calculated to provide measurements that illustrates the similarities between the data distributions. The internal confidence values are used as input for the threshold formulas.
The value for Threshold 1 represents a percentage where the similar areas in the two fingerprints are weighted more heavily than the dissimilar areas.
The value for Threshold 2 represents a percentage where all areas (both similar and dissimilar) in the two fingerprints are given equal weight.
A sampling class has successfully been assigned to data elements with similar distributions to the prototype.