Checklist for the Replicability Line of Assessment

This checklist is used in combination with the guidelines for the Replicability Line of Assessment for the ISWC 2020 Reproducibility Track.

Hypothesis and Overall Design Evaluation

Hypothesis
Overall design of evaluation
Methods to acquire evidence of support
Independent & dependent variables
Environment factors to control for
(in case of several hypotheses) which (parts of) experiments to which hypothesis refer

Target user group - demographics, (level of) domain expertise & technological experience, experience with Semantic Web technologies
Study user group - demographics, (level of) domain expertise & technological experience, experience with Semantic Web technologies
Differences & commonalities between the target and study user groups
Recruitment channels & venues
Compensation provided for participants

For publicly available datasets - version, retrieval date and location, other metadata
For publicly available datasets - preprocessing scripts & resources
For private datasets - characteristics of the dataset and how they refer to the study and tasks.
For private datasets - example inputs for the different tasks. Where possible sample anonymised data should be provided.

Study conditions (within-subject vs between-subject studies) and number of participants
Task assignment and balancing for ordering effects
Tasks and input data per task for each condition (all combinations of tasks if participants have received different (sets of) tasks)
Tasks related to users’ (level of) expertise
Common tasks vs rare tasks
Solution(s) to the tasks, highlighting where these differ significantly between users
Success criteria - binary or continuous & how it is determined
Unexpected results (interim and final) and how these contribute to findings and future work

Hardware configuration and special purpose hardware (for instance, eye-tracking cameras)
Software environment and special purpose software (for instance, screen recording applications)
Surrounding environment & special environment conditions if the system/method is supposed to be used in such
Interaction context (for instance, touch interaction, joystick, large & high resolution displays, etc.)
Presence of observers/members of the evaluation team and their role
The level of expertise/experience of each member of the evaluation team should be documented (one of: novice, some experience, experienced, very experienced).

Provide a timeline of all evaluation phases
Motivate your choice in cases several options are available (for instance, selection of questionnaires)
Implementation details of think-aloud protocols or similar

Measurements of dependent variables
Overall time & time per task for each participant, against expected times (average, minimum and maximum length of each session, whether one-off or longitudinal)
Answers to standardized or custom questionnaires
Observer notes if one has been present
Results per group (in cases the study participants possess various backgrounds)

Motivate data analysis method and statistical tests
Relevant scripts, libraries & other resources for analysis and generating the figures
Potential biases and threats to validity
Data on pilot studies should also be submitted, in addition to key changes made prior to final study, along with explanations for these.

Acknowledgements: These guidelines are inspired by Valentina’s experience from participating in the organization of the VOILA! Workshop series as well as co-authoring of the A Framework to Conduct and Report on Empirical User Studies in Semantic Web Contexts paper. We thank Aba-Sah Dadzie, Catia Pesquita and Patrick Lambrix for sharing their experience, feedback and suggestions while refining this document.