Checklist for the Replicability Line of Assessment

This checklist is used in combination with the guidelines for the Replicability Line of Assessment for the ISWC 2020 Reproducibility Track.

Hypothesis and Overall Design Evaluation

  • Hypothesis
  • Overall design of evaluation
  • Methods to acquire evidence of support
  • Independent & dependent variables
  • Environment factors to control for
  • (in case of several hypotheses) which (parts of) experiments to which hypothesis refer

Target and Study User Groups

  • Target user group - demographics, (level of) domain expertise & technological experience, experience with Semantic Web technologies
  • Study user group - demographics, (level of) domain expertise & technological experience, experience with Semantic Web technologies
  • Differences & commonalities between the target and study user groups
  • Recruitment channels & venues
  • Compensation provided for participants

Input Datasets

  • For publicly available datasets - version, retrieval date and location, other metadata
  • For publicly available datasets - preprocessing scripts & resources
  • For private datasets - characteristics of the dataset and how they refer to the study and tasks.
  • For private datasets - example inputs for the different tasks. Where possible sample anonymised data should be provided.

Study Tasks

  • Study conditions (within-subject vs between-subject studies) and number of participants
  • Task assignment and balancing for ordering effects
  • Tasks and input data per task for each condition (all combinations of tasks if participants have received different (sets of) tasks)
  • Tasks related to users’ (level of) expertise
  • Common tasks vs rare tasks
  • Solution(s) to the tasks, highlighting where these differ significantly between users
  • Success criteria - binary or continuous & how it is determined
  • Unexpected results (interim and final) and how these contribute to findings and future work

Experimental Settings

Setup

  • Hardware configuration and special purpose hardware (for instance, eye-tracking cameras)
  • Software environment and special purpose software (for instance, screen recording applications)
  • Surrounding environment & special environment conditions if the system/method is supposed to be used in such
  • Interaction context (for instance, touch interaction, joystick, large & high resolution displays, etc.)
  • Presence of observers/members of the evaluation team and their role
  • The level of expertise/experience of each member of the evaluation team should be documented (one of: novice, some experience, experienced, very experienced).

Procedure

  • Provide a timeline of all evaluation phases
  • Motivate your choice in cases several options are available (for instance, selection of questionnaires)
  • Implementation details of think-aloud protocols or similar

Analysis of Collected Data

Anonymized raw data

  • Measurements of dependent variables
  • Overall time & time per task for each participant, against expected times (average, minimum and maximum length of each session, whether one-off or longitudinal)
  • Answers to standardized or custom questionnaires
  • Observer notes if one has been present
  • Results per group (in cases the study participants possess various backgrounds)

Analysis

  • Motivate data analysis method and statistical tests
  • Relevant scripts, libraries & other resources for analysis and generating the figures
  • Potential biases and threats to validity
  • Data on pilot studies should also be submitted, in addition to key changes made prior to final study, along with explanations for these.

Acknowledgements: These guidelines are inspired by Valentina’s experience from participating in the organization of the VOILA! Workshop series as well as co-authoring of the A Framework to Conduct and Report on Empirical User Studies in Semantic Web Contexts paper. We thank Aba-Sah Dadzie, Catia Pesquita and Patrick Lambrix for sharing their experience, feedback and suggestions while refining this document.