Perform an extraction


  1. List of definition files
  2. Extraction details
  3. Performing an extraction
    1. Since the CMS
    2. For a URL
    3. Creating a new output
  4. Download results
    1. Since the CMS
    2. For a URL

To execute an extraction, you must have the 'Execute extractions' right.

List of definition files

In the Home tab, click on the 'Extractions' button:

The 'Extractions' tool opens, listing all existing definition files. Definition files are located in the WEB-INF/param/extraction directory:

A contextual 'Extractions' tab also appears in the ribbon:

Extraction details

Select a definition file and click on the 'Modify' button from the 'Extractions' tab:

The 'Extraction details' tool opens, displaying the various extraction components in tree form. The tool title is the name of the definition file:

Performing an extraction

Since the CMS

In the 'Extractions' tool, select a definition file and click on 'Execute'. from the 'Extractions' tab. You can also click on the 'Execute' button in the 'Extraction details' tool.

 

Certain parameters are required to run an extraction. The extraction execution tool opens, allowing you to view the extraction details and enter these parameters:

  • Output: Select the output to be used to render the results of the extraction run. See Creating output below. A "Default" output is available for executing without a style sheet in XML format.
  • clause variables: the definition file can define variables which are then used in query clauses. The values of these variables are either contents that you must supply by selecting them here, or a query solr. See how to add clause variables.
  • optional columns: The definition file can define variables used to define optional columns. For each variable, you must indicate whether the columns dependent on it will be displayed in the result or not. See how to create variables for optional columns
  • Address email : When extraction is complete, a email will be sent to the address you specify here. If the extraction has failed,email will contain a summary of the error.

In the administration area, you can schedule a 'Run extraction' task. The parameters requested are the same as those shown above. Values for query variables and optional columns must be supplied in JSON format.

For a URL

An extraction can be run fromURL url -du-cms.ext/_extraction/extract?file=extractionDef.xml &lang=en&maVariable=value&pipeline=pipeline1.xml

  • file: this is the extraction definition file. Only this parameter is mandatory, the others are optional.
  • lang: language to be used for extraction
  • <variables> : renseigner les valeurs (identifiants de contenus ou requête solr) des variables utilisées dans les clauses des requêtes (voir partie précédente)
  • pipeline: the output path to be used for rendering results (see previous section). If not present, the default output is used

URL is authenticated, and token authentication is possible. The logged-in user must have the right to "View and execute extractions".

Of course, query parameters must be encoded.

Creating a new output

An output is the combination of 0, one or more style sheets ( XSLT files), and a result file format (XML, text, PDF).
Stylesheets ( XSLT files) are used to customize the extraction result file.

If you have chosen to generate output in PDF format, this style sheet is mandatory.

An output definition is defined in a file XML in WEB-INF/param/extraction/config

It is used to define the output label, style sheets and output format.

Its syntax is as follows:

<pipeline>                    
    <label i18n="true">I18N_KEY</label>                    
    <stylesheets>                    
        <xslt name="foo.xsl"/>                    
        <xslt name="bar.xsl"/>                    
        <xslt name="qux.xsl"/>                    
    </stylesheets>                    
    <out type="text" path="monchemin/vers/${meta1}/mondossier" extension="rtf"/>       

    <extractions>       
        <extraction>extraction_1.xml</extraction>       
        <extraction>extraction_2.xml</extraction>       
    </extractions>                   
</pipeline>                    

The label can be a key in a i18n catalog, or just a character string.

Style sheets must be located in WEB-INF/param/extraction/stylesheets

The type attribute on the out tag is mandatory. Valid output format types are xml | text | pdf
The path attribute on the out tag is optional. This is the path of the subfolder/file in which the results of the extraction run will be placed. The sub-folder may contain variables, for example, if you set mypath/to/${title}/mondossier/result.xml then a result file will be created in mypath/to/Content1/mondossier/result.xml, one in mypath/to/Content2/mondossier/result.xml,etc.
The extension attribute on the out tag is optional. It is the extension of the result file(s) when the path attribute is a folder and not a file.

On the other hand, if the type is text, the following attributes can be added:

  • encoding with the desired encoding(UTF-8 by default)
  • method which is set to text by default

You can enter the list of extractions managed by this output. If no extraction is entered, the output will be proposed for all extractions.


Download results

Since the CMS

In the 'Extractions' tab, click on the 'Results' button :

 

The 'Results' tool opens, containing a list of all existing results files. The name of a results file is constructed from the name of the definition file and the execution date:

To refresh the tool, click on the Refresh button :

 

To download a results file, select it and click on the Download button. :

To delete a results file, select it and click on the Delete button. :

For a URL

The results of an extraction run can be downloaded fromURL url -du-cms.ext/_extraction/download/my/path/to/the/result.xml

my/path/to/the/result.xml is the path to the result file.

URL is authenticated, and token authentication is possible. The logged-in user must have the right to "View and execute extractions".

Since version 4.4, theURL link to access the results is included in the mail file received when an extraction is run. 

In the administration area, you can schedule a "Delete obsolete extractions" task. This takes into account a lifetime parameter: extractions older than this will be deleted. 

Back to top