How to ... find and use detailed financial data on thematic investments?

1. What is categorisation data? How to use it? 

How much cohesion policy funding is invested in renewable energy?  What investment is planned in clean urban transport?  How is research infrastructure investment progressing? What about transport ? ... Water investments?
... Education?  ... Labour market support?  
Categorisation data provides the answers. Categorisation data is most detailed financial data we have from national and regional programmes on the specific actions financed with EU cohesion policy funding. The categorisation system is made up of different dimensions looking at different aspects of the investments from actions (intervention fields), to forms of finance (grants, equity, loans, etc.), to territory type (urban, rural, etc.,) and other dimensions. A more complete overview of the categorisation information system is presented in this story.
Our specific focus in this "How to ..." is on this 2014-2020 "categorisation planned vs implemented" dataset, which compares the total planned financial categorisation amounts to the annual financial data reported in cumulative selected projects and their expenditure (decided and spent). In particular we focus on how to extract and/or analyse progress in a specific thematic area. As an example we will look for the (thematic) renewable energy investments, which we can identify using the "intervention field dimension".  
The Intervention Field dimension is particularly important as it provides data on thematic allocations that are key to the tracking of climate action, biodiversity and air quality tracking and, ultimately, for the targeting of the evaluation of investments.

2. First steps

- Understand the data structure 

To use the "categorisation planned vs implemented" data users need to understand what the data is about and how it is structured. 
Tips: 
  1. Read the description of the data on the metadata (landing) page. It provides important pointers to the data structure and a description of the different data columns.
  2. The dataset "Categorisation planned vs implemented" has the following important characteristics:
  • It provides annual snapshots of the cumulative data for the entire period 2014-2020, at the end of each year;
    => Do not add the years together!  Make sure you filter or separate the years in your analysis.
  • The data contains multiple dimensions looking at the same volume of financial data from different points of view.
    => Again, do not aggregate several dimensions. Make sure you filter or separate dimensions in your analysis.
  • Each dimension is made up of "dimension codes" with long and short titles. You will need to look at the lists of codes for the elements that you want to analyse. 
    => for renewable energy we need to look for "InterventionField" codes 009, 010, 011 and 012. 
  • NB: Several dimensions have "planned" and "implemented" values - such as the Intervention Filed dimension.  Some dimensions only have implemented values as the programmes are not asked to make specific plans and allocations in advance for those dimensions.

- Use available filtering and aggregation functions on the platform 

STEP 1: If you are making an ad hoc query, you can download a filtered section of the dataset to explore.
Starting from the metadata page choose "View Data" as below:
STEP 2: On the data page choose "Filter":
STEP 3:  Add filter conditions to the data as needed:
  • Add a filter condition for "Dimension type"; as we are interested in thematic allocations we need to write "InterventionField" under the "Dimension Type" filter condition (NB: the precise spelling and capitalisation must be used); Do not add other dimension types; 
  • Add a second filter condition choosing "Year" from the choice of available variables; We are interested only in the last available year [2019];
    (We could choose several or all years but must remember not to aggregate the yearly values later).
  • We are interested in the renewable energy investments involving codes 009, 010, 011 and 012. We add "Dimension Code" as a filter condition and add the four codes making sure they all have a blue tick.  
  • As we filter, the dataset view to the left of the filter menu changes to show a sample of the data.
  • You might also want to limit the data by EU Fund or by country, or some other combination.

3. Exporting / reuse

The next step is exporting (downloading) open data for analysis and reuse. 

Option 1: Static export for ad hoc local analysis

Once you have filtered the data (or you are ready to export it unfiltered), you can manually export the dataset. 
Clicking on "Export" you are offered a variety of formats. The choice will depend on your own preferences. If you are using Excel in an EU language, we recommend to use the format: "CVS for Excel (Europe)".

Check that the downloaded data is correctly formatted 
The formatting of the downloaded data should be checked to ensure accurate export of text and numeric values. 
Tip: check the aggregated values else where on the platform (see f.i. the categorisation sankey data story , which links to the latest planned and implemented values by fund, built with the planned vs implemented data, and allows you to cross check your queries against the values given in the sankey charts). 
Below you can check the correct planned allocated to the four renewable energy codes 009, 010, 011 and 012 showing the aggregated planned allocations for ERDF and CF as of end 2019.   Using the filter menu (top right of the chart) users can  change the intervention fields / funds / years as needed. 
NB: In the #ESIFopendata user guide annexes there are tips for correcting MS excel settings to address a common problem with punctuation of numeric values. 

Option 2: Dynamic export using ODATA

With an ODATA link you can link to an #ESIFopendata datastet, build your own analysis and later you can refresh the data and your analysis automatically
For instance, connect a new EXCEL© worksheet to the open data dataset using an ODATA connection, then you can make your analysis of the data in excel (i.e. using pivot tables). Later, if you know the original data has been updated, you can refresh your copy of the data and your analysis will update automatically. 
Tips: 
  • Set up the ODATA link to download the data before you start your analysis (it saves time). 
  • Find the ODATA address under the "Export" menu on the data page (see across).  
  • Copy the ODATA link into the software you are planning to connect.  
    In Excel(c) 2016, from the menu follow
    => Data / New Query / From Other Sources / From ODATA Feed 
    => Then paste the complete ODATA V2 / V4 Endpoint.     
ODATA Advantages:
  • Automatic updating of your local data copy from the open data original. This is very useful if you will reuse your analysis in the future. 
  • It is not difficult to set an ODATA link if you use this guide by Socrata.
Disadvantages:
  • ODATA downloads entire datasets.  This may lead to regular downloading of significant MB of data that you may not need. 

Option 3: Dynamic reuse using SODA API queries (advanced)

Each #CohesionOpenData dataset has documentation for making Soda API calls (queries).  An API - application programming interface - is a computing interface that defines interactions between multiple software intermediaries.
Advantages: 
  • Open data content can be published automatically and is available for multiple users and channels; 
  • It allows automated data sharing to be customised by users; 
  • Efficiency and adaptability.
Disadvantages:
  • You need IT coding skills to access and reuse the open data. 

4. More information

For more information on the Categorisation system 
Follow us on Twitter: @RegioEvaluation - #ESIFopenData

Contact