How to properly create a dataset
How to create a dataset correctly Before your first datasets are published, the data must be prepared for opening. Some data formats limit the ease of further processing. Others, on the other hand, make further automated use much easier. To provide open data, it is important to provide it in formats that are as open as possible. For example, tabular data in PDF format is easy for humans to read, but difficult for machines to interpret. Read below for information on what to think about when creating datasets and how to edit them properly.
We recommend that you include the topic of technical preparation of data records in the training of the IT manager in your organization. References to additional sources of information, which we try to provide in the text, will also help you to better understand.
Analysis of the dataset, proposal of the publication method
Before publishing your datasets, it is important to think about the content and structure of the data itself. Of course, it is possible to publish, for example, a simple CSV file with a clear data structure and it will be a usable resource. Even in this case, however, the dataset should undergo an initial analysis. The analysis is carried out by its Curator, who in cooperation with the IT specialist checks the content of the dataset, chooses the degree of openness in which it will be published and, if necessary, chooses the data schema defining its structure.
As a number of public bodies publish similar datasets, uniform data schemes for specific areas are being developed at national level. This is an effort to harmonise the approach of data providers and standardise the published datasets for a given domain, which greatly simplifies subsequent use by data processors. Thus, if an Open Formal Standard already exists for a given dataset, it can be used or extended to define the content of the dataset.
The open formal standards within the meaning of Section 3(9) of Act No. 106/1999 Coll., on free access to information are binding for open data providers who are obliged subjects under Section 4b(1) of Act No. 106/1999 Coll., on free access to information. These are technical recommendations focused on selected datasets that ensure that the same data published by different providers will be interoperable. This makes it easier to use such data, regardless of which provider it comes from. An overview of Open Formal Standards (OFNs), information on their importance and usability can be found here.
Select the degree of openness
To be truly usable, the data you publish should meet basic recommended standards. In practice, this means that the data often needs to be edited (cleaned). For example, in the case of publishing an Excel file, check the structure and integrity of the data according to the degree of openness chosen in your publication plan. In the Czech Republic, as in Germany, the definition of 5 degrees of openness is used for this purpose at national level. In both countries this is based on the five-star model of Tim Berners-Lee2
- Level 1 – the dataset is available on the WWW with appropriate conditions for the use of open data
- Level 2 – the dataset is provided in a machine-readable format that allows automated processing
- Level 3 – the dataset is provided in an open format, i.e. a format with a freely available specification
- Level 4 – IRIs are used to identify entities in the dataset,
- Level 5 – the dataset meets the Linked Data standard.
A detailed description of the levels of openness, including recommended technical standards for each level, can be found at opendata.gov.cz
During the development of the publication plan, the coordinator, in collaboration with the individual curators, selected specific datasets and created a publication plan. In doing so, they took into account a number of things, such as the compliance of the published datasets with legal regulations and publication standards, the assessment of the need and method of data transformation, or the quality of the datasets.
Based on this information, a risk analysis needs to be carried out before the actual publication and a way to address any risks needs to be proposed. If you use the model publication plans, the risks are identified in them and can be taken on board. If you are creating your own publication plan in which you plan to open your own datasets, we recommend following the analysis recommended by the Open Data Portal: Identifying the risks of opening datasets
- Disclosure of data in violation of the law
- Violation of trade secret protection
- Violation of the protection of personal data
- Disclosure of inappropriate data or information
- Misinterpretation of data
- Absence of data consumers
- Overlapping of data.
- Threats to the security of the state / property / persons
For more information on risk analysis in the decision to publish datasets, click here.
Determination of conditions of use
When preparing your own datasets, it is important to think about how they will be used. The first thing to do is to clarify whether your organisation actually owns the rights to the data in question and whether you can regulate the re-use of the data by third parties. In the case of data that has been collected by a service provider on behalf of your organisation, for example, there may be contractual rules that restrict the transfer and reuse of the data. Here, appropriate negotiations between your organisation and the rights holders would be necessary.
Within the Czech Republic, you can use the clearly elaborated guide on how to create licensing permissions for seamless use of open data published at opendata.gov.cz
In the case of Bavaria, we recommend the information published in the German National Portal, which provides both the possibility to publish datasets at the national level and in the Open Data Portal Bayern. A list of licenses accepted on GovData.de can be found here.
If your organisation owns the rights to the datasets, you can focus on reuse policies to make it clear to users how they can continue to process your data. Other restrictions on the reuse of your datasets limit users’ ability to use your data (in a meaningful way). Therefore, the policy should allow for free reuse.