Return to ENVRI Community Home
Particle formation is an atmospheric process whereby at specific spatial locations aerosol particles form and grow in diameter size over the course of a few hours. Particle formation is studied for its role in climate change and human respiratory health.
To study these processes, particle formation needs to be detected for where and when it occurs. Having detected particle formation, the processes are characterized for their qualities, e.g. duration, growth rate and other attributes. The detection and characterization of atmospheric particle formation relies on the measurement of particle size distribution, typically using an instrument called Differential Mobility Particle Sizer (DMPS).
In the context of particle formation research, particle size distribution as measured by a DMPS is observational data – in other words primary, uninterpreted data. For each day and location, observational data are processed and interpreted to detect and characterize particle formation. Observational data processing and interpretation are carried out by one or more human experts (typically postgraduate students). This constitutes an in silico (i.e., performed on computer) and human-in-the-loop scientific workflow. In the context of particle formation research, the output of such workflow is information describing (i.e., about) individual particle formation processes.
Information is truthful, meaningful, well-formed data (Floridi, 2011) – in other words secondary, interpreted data. Information is commonly also referred to as “data + meaning” and is also known as “data product.” Meaning is created in workflow execution, in which human experts also ensure that the resulting meaningful well-formed data are truthful. Information describing individual particle formation is further processed into summary statistics, e.g. the average duration. Such summary statistics are ultimately reported in scientific literature.
The use case aims to, primarily, (1) harmonize the information describing particle formation; (2) represent information, specifically the meaning of data, using an appropriate computer language; and (3) acquire and curate information in infrastructure.
|Background||Contact Person||Organization||Contact email|
|ICT||Markus Stocker||TIB, PANGAEAfirstname.lastname@example.org|
|RI-Domain||Jaana Bäck||University of Helsinkiemail@example.com|
|e-Infrastructure||Yann Le Franc||EUDATfirstname.lastname@example.org|
|ICT||Robert Huber||UniHB, PANGAEAemail@example.com|
Data Use, Data Acquisition (primarily)
Data Curation (secondarily)
Data Publication (possibly)
Relevant Data Use Community Behaviors
Relevant Data Acquisition Community Behaviors
Relevant Data Curation Community Behaviors
Relevant Data Publication Community Behaviors
Relevant Data Use Community Roles
Relevant Data Acquisition Community Roles
Relevant Data Curation Community Roles
Relevant Data Publication Community Roles
Section 1.1 provides a summary of the primary aims of this use case. We begin this section by providing a more detailed description of the aims. Where applicable, we discuss how these aims align with FAIR Principles (Wilkinson, 2016). Aims marked optional will be addressed if time permits.
In detail, the use case aims at the following:
There exist multiple, institutionally and geographically distributed, research groups that perform the scientific task of interpreting particle size distribution observational data to detect and characterize the occurrence of particle formation at determinate spatiotemporal locations. Two groups well-known to the authors of this use case are the Atmospheric Aerosol Physics research group at the University of Eastern Finland and the Aerosol Cloud Climate Interactions research group at the University of Helsinki.
The first objective is to harmonize how these groups describe particle formation – i.e., harmonize the information (as data + meaning) created as a result of observational data interpretation via the scientific workflow. The objective addresses aims 1-4. To catalyze this predominantly community driven work, we aim to organize a workshop (Q2 2018, Helsinki). The workshop brings together representatives of the research community, research infrastructures, e-Infrastructures, and ICT specialists. The objective of the workshop is to advance, primarily, Aim 1, and address, secondarily, Aim 5 and Aim 6. The results of the workshop will inform Aim 4, i.e. the development of an ontology design pattern.
A concrete proposal for such a pattern to build on has already been developed. As part of this work, the Environment Ontology has recently introduced a concept for “formation of particles in an atmosphere”. Figure 1 describes the concept as visualized by the EMBL-EBI Ontology Lookup Service.
Figure 1: The concept "formation of particles in an atmosphere" of the Environment Ontology, as visualized by the EMBL-EBI Ontology Lookup Service. The concept is described as an atmospheric particle formation process that occurs in an atmosphere and has some aerosol as output. The concept is specialized in formation of liquid droplets or solid particles.
It is proposed that this use case builds on and extends this concept. This approach ensures that information describing particle formation conforms to the FAIR I1, I2, I3, R1, and R1.3 principles. The approach is thus expected to contribute substantially to improving the interoperability and reusability of information describing particle formation, i.e. data and meaning created by the research community in data interpretation performed using the scientific workflow.
The second objective is to expose the scientific workflow as a service used by the particle formation research community, thereby connecting the research community to infrastructure. Addressing Aim 5, the objective adopts the solutions proposed in aims 1-4 and implements the scientific workflow in Jupyter as well as deploys the implementation on e-Infrastructure.
Figure 2: Prototype Jupyter implementation of the scientific workflow. Following the initialization of day and location, particle size distribution observational data are fetched via the SmartSMEAR API and plotted. Such visualization is used by researchers to detect and characterize particle formation at the specified day and location. Information describing particle formation is recorded. Information is represented following the developed ontology design pattern and is acquired and curated by infrastructure (specifically, acquired via a SPARQL endpoint and curated in an RDF database, both deployed on EGI).
Substantial work has already been conducted toward this aim. In fact, a prototype Jupyter implementation has been implemented and deployed on EGI. Figure 2 provides an overview of the graphical interface that exposes the scientific workflow to the research community.
The workflow provides specialized functions to fetch and plot particle size distribution observational data as well as to acquire information describing particle formation in infrastructure. The details of fetching observational data and converting them into a (Python) native data structure as well as the details of representing and acquiring information are taken care of by the infrastructure, also by means of a Python library with specialized functions (e.g., fetchdata). As such, the research community can focus on the primary task at hand, namely the interpretation of observational data.
The prototype Jupyter implementation can be extended in order to address aims 7-9. Curated information describing particle formation can be processed to compute summary statistics. Such computation can be performed by extending the Jupyter Notebook, and made easy by extending the Python library with specialized functions.
Addressing Aim 6, the third objective is for infrastructures (RIs, e-Infrastructure, data centers, or similar) to systematically acquire and curate information describing particle formation. In this regard it is interesting to determine the kind of infrastructure best equipped to acquire, curate, and possibly publish information describing particle formation. Currently, such information is curated as data (with little or no formal meaning) on the computer hard drives of researchers. Since such information generated by the research community is extremely valuable, e.g. for integrated particle formation analysis, as well as essential for the reproducibility of summary statistics published in literature, it is evident that the systematic acquisition and long term curation of such information is important. A further interesting aspect is arguably the mode of information acquisition. In the proposed Jupyter implementation, acquisition occurs via a specialized function each time information describing particle formation is recorded. Technically, recording can occur on any infrastructure, including the researcher’s workstation. The research community should ultimately decide with which acquisition (and publication) mode as well as which infrastructure it wants to operate.
Realizing these objectives will have a couple of interesting impacts. First, harmonized information describing particle formation will result in (more) interoperable and reusable (FAIR) data that can be integrated for further processing, e.g. spatiotemporal visualization of particle formation or their statistical analysis. This is expected to hold across distributed research groups and is a result of adopting a community-agreed ontology design pattern, the harmonized representation of information, and the exposure of the research groups within the community to a common scientific workflow for particle size distribution observational data interpretation.
The second impact is the possible systematic acquisition and curation of explicit and formal (i.e., machine actionable) meaning of data (in addition to the data themselves). Rather than merely acquiring data products in form of, e.g., visualizations such as maps or plots (with implicit information content not available to machines) this use case aims to set an example for how infrastructures can systematically acquire and curate truthful, meaningful, well-formed data (i.e., information) whereby meaning is explicit and formal. Furthermore, we expect that harmonized information generated by distributed research groups will be easier to acquire for infrastructure, and thus curate and possibly publish. As such, the use case contributes to advancing infrastructures from the current data systems to information and knowledge-based systems (Stocker, 2017) that manage information about natural worlds and their phenomena of interest (in addition to information about people, organizations, instrumentation, publications, etc.).
A key challenge is to bring together representatives of the research community studying particle formation and come to an agreement for how to harmonize the information describing particle formation. It is unclear whether such agreement is desired and achievable. At this stage it is also unclear whether the required people can be motivated to attend the planned workshop.
A further challenge is to motivate the use of the scientific workflow implementation across research groups in the community. The greatest benefit of the proposed approach will result from research groups adopting the service, rather than individually implementing their own.
A third difficulty is the lack of clarity for whether it is possible for infrastructure to systematically acquire, curate and potentially publish the information describing particle formation as envisioned in this use case.
The basic scenario is for research groups, specifically individual researchers, of the atmospheric aerosol particle formation research community to be served with a service that implements a scientific workflow for particle size distribution observational data interpretation and the systematic acquisition, curation and possible publishing of information describing particle formation, resulting from observational data interpretation.
The service should enable researchers, members of distributed research groups, to execute a scientific workflow that fetches and visualizes observational data from one or more (selected) research infrastructures. The workflow should support the detection and characterization of (i.e., extraction of information describing) particle formation. Such description should reflect a community-agreed ontology design pattern.
Furthermore, the scientific workflow should support the further processing of information describing particle formation, e.g., to obtain summary statistics about particle formation at specific spatiotemporal locations.
Advanced scenarios include the possibility for the research community to inspect provenance relating summary statistics published in literature to information describing particle formation and particle size distribution observational data as well as the to the relevant involved agents and activities.
Of interest to advanced scenarios is also the possibility to openly publish information describing particle formation as well as the support for functionality relevant to data publishing, such as persistent identification and citation of information describing particle formation.
The required components are Jupyter, the implementation of the scientific workflow as a Jupyter Notebook, an RDF database with SPARQL endpoint, as well as a Python library with specialized functions. Figure 2 shows a visualization of the prototype implementation of the scientific workflow. The components are containerized using Docker and can easily be deployed on infrastructures such as EGI. Indeed, this has already been tested with the deployment at http://220.127.116.11:8888. Recently, we have adopted JupyterHub in order to support authentication of multiple users and management of individual notebooks.
Most of these components exist, specifically Jupyter, JupyterHub and the RDF database with SPARQL endpoint. These are all open source projects of high technical readiness.
The scientific workflow should be extended with further functionality. Some are already planned while others will result from obtaining research community requirements. Prototype functionality exists for automated machine detection of particle formation. This is supported by a trained machine learning classifier. We plan to extend this to support automated machine characterization of particle formation with functionality designed to extract information about detected particle formation, such as duration and growth rate. This intermediate step of the scientific workflow aims at supporting the research community in observational data interpretation by providing an automated machine extraction of information describing particle formation, which can subsequently be reviewed by researchers. As such, the automated extraction results in semantic content while the expert review results in semantic information (Floridi, 2011). Semantic information is truthful, in addition to meaningful well-formed data. Truthfulness is determined by human experts.
Scientific workflow functionality depends on the specialized functions provided by the Python library. The maturity of this library and, thus, the scientific workflow implementation is prototype. It works reliably but would benefit from a thorough redesign based on software engineering principles. The library is publicly available on GitHub.
The use case involves observational data for particle size distribution. Such data are currently obtained via the SmartSMEAR API of the SMEAR research infrastructure. Envisioned functionality may support the selection of observational data sources, one of possible several research infrastructures. Such selection could be supported as an additional (configuration) step in the scientific workflow. We are also considering linking this functionality with the ENVRIplus Knowledge Base developed by Theme 2. The Knowledge Base is designed to manage and support the querying of research infrastructure descriptions. Assuming descriptions of relevant infrastructures, such as SMEAR and ACTRIS, a demonstrator may succeed in linking the scientific workflow discussed here to the Knowledge Base to support selection of observational data sources as well as automated configuration of the corresponding API call to fetch observational data.
Overall, the use case is arguably already in a fairly advanced stage. While further technical advances are possible, the more critical advancements now rely on collaborative work with the research community, such as achieving agreement on representing information describing particle formation and adoption of the scientific workflow as a service.
We envision the following implementation plan. First, we plan to organize the aforementioned workshop during Q1 2018 and hold the workshop during Q2 2018, possibly in April ahead of the next ENVRIweek, which would allow for presenting results on aims 1-3 during ENVRIweek. The successful execution of the workshop is a milestone for this use case.
During Q1 and Q2 2018, we plan to make further improvements to the Python library of specialized functions and thus to the implementation of the scientific workflow. Specifically, we plan to create a deployment based on JupyterHub; update the information representation to reflect the ontology design pattern currently published by the Environment Ontology; improve the functionality for automated extraction of information about particle formation; develop a concept for integrating provenance. Completing these steps are all milestones for this use case.
In the second half of 2018, ahead of the Fall ENVRIweek, we plan to address aims 4 and 5. During the ENVRIweek we plan to have a demonstration of the use case.
The 2018 ENVRIweeks will serve to address Aim 6. Relevant research infrastructures for Aim 6 include SMEAR and ACTRIS as well as EUDAT and data centers (e.g. PANGAEA).
Assuming we have achieved the milestones as planned, the remaining time of the ENVRIplus project during 2019 will be used to complete the aims 7-9. The first half of 2019 should also lead to concrete results in research groups of the community using the proposed service. Furthermore, we will complete the report that results from addressing Aim 6. Depending on the developments with infrastructures, this report may discuss a concrete implementation of information acquisition, curation, and possibly publishing in a sustained infrastructure. The final results and conclusions of the use case will be presented during the Spring ENVRIweek in 2019.
The developed concept for provenance and its implementation may serve as a demonstrator to ENVRIplus Theme 2 WP 8.
By connecting a research community with infrastructure in the data use phase of the research data lifecycle, the use case relates to ENVRIplus Theme 2 WP 7. The results of this use case may serve as demonstrator to the WP insofar as it supports a concrete research community in data analysis, specifically data interpretation, and connects the output of analysis, i.e. information, with the data acquisition phase of the subsequent iteration of the lifecycle.
Finally, linking the scientific workflow with the ENVRIplus Knowledge Base in order to support selection of observational data sources and, possibly, automated retrieval of data required in workflow execution also relates to Theme 2 activities and the implementation may serve as a demonstrator in this context.
The use case expects the following (primary) outputs:
Floridi, L. (2011). The Philosophy of Information. Oxford University Press.
Perez, F., Granger, B. E. (2007). IPython: A System for Interactive Scientific Computing, in Computing in Science & Engineering, vol. 9, no. 3, pp. 21-29. https://doi.org/10.1109/MCSE.2007.53
Stocker, M. (2017). Advancing the Software Systems of Environmental Knowledge Infrastructures. In Abad Chabbi and Henry W. Loescher (Eds.), Terrestrial Ecosystem Research Infrastructures: Challenges and Opportunities, pp. 399–423. Taylor & Francis Group, CRC Press. ISBN: 9781498751315 https://doi.org/10.1201/9781315368252-16
Wilkinson, M. D., Dumontier, M., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3. https://doi.org/10.1038/sdata.2016.18
 Strictly speaking these functions are not necessary but they simplify the workflow