How to Evaluate a Programming Language From a Usability Point of View

Written by Glykeria Alvanou | Nov 5, 2018

When we refer to the term “usability” we usually have in mind the ease of utilizing user interfaces which vary from desktop to mobile applications. However, in the field of software development it is also interesting to examine if the programming languages are actually usable enough to let the developers work efficiently and effectively.

Can we really measure the usability of programming languages? And if yes, how?

Challenges when evaluating a programming language

The short answer is yes, but evaluating a programming language from a usability point of view is much more complex than evaluating a classical user interface. Several aspects need to be considered:

The test users need to be familiar with the language first, especially if it is a new language. This can be time-consuming depending on the complexity of the language and thus makes it more difficult to find test subjects for the evaluation.

The usability engineer needs technical knowledge to understand the language in order to create suitable test setups and find suggestions for improvements.
Finding suitable test tasks, e.g. for a usability test, is difficult. You need to focus on specific parts of the language - evaluating the whole language at once is nearly impossible because it would take much time. But how to select which keywords of the language are the most important ones?
You need to select a suitable test method: can the complexity of a language be evaluated in a questionnaire or a usability test or do we need test users evaluating the language over a longer time period?
What are the right criteria to say a language is usable? How can you say a language is “better” than another one? What should be measured to find this out? Several measurements could be interesting, such as learnability, understandability or consistency – What is most important?

Our approach to evaluate a programming language

We were facing these challenges when we were supposed to evaluate the VistraQ language – a query and analysis language for efficient recording and use of traceability information.

We decided to compare VistraQ with other similar query languages because a comparison would give us more reliable and complete results. We chose Cypher and SPARQL as the two comparison languages, since VistraQ

is based on Cypher and
is used for traceability graphs – in a similar way that SPARQL is used for RDF graphs.

The question is now: what are we doing with these three languages? How can we say that one of these languages is more usable than another? In general, there are three aspects that should be taken into account when it comes to usability:

Effectiveness: can users reach their goal successfully and accurately (usually measuring correctness & completion rate)?
Efficiency: how much effort and resources are required to realize tasks (usually measured in time)?
Satisfaction: what do the users think about the product? How comfortable do they feel when using it?

But how can we measure that for a programming language?

Evaluating a programming language in an online survey

A common method for measuring the parameters above is to conduct a usability test with real users.

Considering the fact that usability tests are usually taking a lot of time and need a big amount of resources, we agreed that the type of our evaluation will be an online questionnaire. This will save us a significant amount of time, facilitate our process and help us gather as many participants as possible.

For our analysis of the survey results, in our opinion the most important criteria is the time the user spends completing the questionnaire as well as the time for answering each question. This measurement could be an indicator of how fast the user can learn the language (learnability).

Moreover, a crucial aspect of the analysis is the definition of correct/wrong answers. Questions like, “What is considered to be an error?”, “Does a syntax error have the same importance like a semantic error?” or “If the correct answer is “LINKED TO” and the user writes “Linked to”, how correct or wrong is it?” have to be examined and analysed in depth. Together with the time measurement, the analysis of correct/wrong answers is an indicator for the understandability of the language.

Through these measurements, we want to answer the following questions:

How much time do the users need to answer a specific question in each one of the three languages?
For each question, is there a significant difference in the response time between the three languages?
How many correct answers do we have for the question ‘X’ in each one of the languages?
The total time of completing each questionnaire differs greatly. What does it show us?

Online survey design in detail

To start our survey, based on collected user requirements and needs, we created different types of questions to answer the above mentioned points. The different types give variability to the survey, as the user do not always have to answer the same question pattern and we avoid the risk of getting random answers after some point. We also have a variety of results, a fact that is helping us to measure different evaluation criteria e.g. types of errors, understandability and time. The set of questions we are using in our questionnaire are:

Multiple choice questions, where the user has to select the correct query for the given question. In this way, the various answers have to be read in detail and the user has to distinguish the correct one by matching the query commands to the question’s keywords.
Gap questions, where the user has to add the missing command of the query. This type of questions evaluates the syntax as well the semantics of the query and the understanding of the users.
Adaptation questions, where the user has to adjust the given query to a new problem. This is considered to be a “real world problem”, since it is a typical task that a middle manager has to accomplish in their daily work.

Part of our online survey is also the System Usability Scale (SUS), where the participants have to subjectively assess their experience on each one of the three languages. SUS is a cheap and quick method to gather valid statistical data about the level of user satisfaction and that’s why it is used for our qualitative evaluation.

What have we learned?

What we learned (and still learn) from the whole process is that evaluating the usability of a programming language (or a query language in our case) is a rather challenging and tough task and differs from the conventional and usual usability evaluation of digital user interfaces.

First of all, difficulties in evaluating programming languages arise, because of their complex nature and the lack of established usability evaluation methods for programming languages.

Ιt also requires a lot of effort to decide on which elements of the language have to be tested and examine if these are actually representative of the whole language. In this case, the support of a technical expert is helpful to make such decisions.

This also helps when defining the correct answers in the questionnaire: as usability engineers with a standard usability knowledge it is hard to find out and test the correct matching queries to each question - this requires a lot of time but also increases the technical knowledge about the query languages.

As the work is still in progress, we are excited to see the results of our evaluation and we hope that we can come to meaningful conclusions about the usability of query languages and the potential to improve them regarding their “ease-of-use”.

Stay tuned!

View full post