Summary of "From Testing to Evaluation of NLP and LLM Systems"

This work compares academic research in evaluation with practitioners' questions on community forums.

Dec 31, 2024

A recent preprint, titled “From Testing to Evaluation of NLP and LLM Systems: An Analysis of Researchers and Practitioners Perspectives through Systematic Literature Review and Developers’ Community Platforms Mining” presents a comprehensive quantitative analysis of how software systems based on large language models (LLMs) and Numerical Language Processing (NLP) are being evaluated in academic research while examining potential misalignments with practitioner needs. The study employs a rigorous dual methodology: a systematic literature review of academic papers and an analysis of practitioner discussions in online forums.

Scope

The researchers conducted their systematic literature review with careful attention to methodological rigour. They began by querying Scopus for papers whose abstract contained “test”, “software” and at least one among a set of keywords about LLMs or NLP. Moreover, the search focused on computer science or engineering papers published after 2013 and before the end of January 2024 (when the query was issued). This returned a total of 782 papers. These were filtered down according to some criteria (detailed in the paper) to 39. Finally, this was supplemented by a search on Google Scholar, reaching a total of 77 papers.

The choice of requiring abstracts to include the word “software” is peculiar, as I expect many papers on NLP evaluation to not include it, and explains the relatively small number of papers found. The reason is likely that, as they mentioned, the authors aimed originally to include works on software-engineer-like “testing”, but then expanded the analysis to “evaluation” as well.

Analysis Framework

The study determined a taxonomy for five key dimension using a rigorous process considering the selected papers and annotated them. First, they looked at evaluation objectives, the taxonomy of which is displayed below.

This was complemented by analyses of the models being evaluated, the specific tasks being performed, the datasets employed, and the evaluation methods utilized. The latter category encompassed various approaches including automatic test input generation, evaluation benchmarks, human evaluation, fine-tuned evaluators, and statistical analysis tools.

Research-Practice Gap

One of the study's most valuable contributions is its examination of how these different dimensions correlate with each other and, more importantly, how they align (or don't) with practitioner concerns. This analysis revealed several areas where academic research focus doesn't fully match practitioners' needs. For instance, they find that fairness and robustness are significantly more important to researchers than practitioners, while the opposite holds for efficiency, syntactic correctness and factual correctness. The key findings are summarised below.

Snapshot from Section 7 of Asgari et al. 2024. I was surprised by the fact that robustness is not a major concern for practitioners.

Limitations and Critical Analysis

While the paper makes important contributions, there are some methodological limitations worth considering. First, paper counts may be a poor proxy for research effort, and directly comparing it with the proportion of practitioner forum posts may oversimplify the relationship between research attention and practical needs. However, This approach might not capture the actual impact or depth of research in different areas. Still, I feel this is a good starting point.

Additionally, the relatively small sample size of 77 papers, especially given the large number excluded through filtering criteria, raises questions about the statistical significance of the correlations and findings. A sample restricted by strict inclusion criteria might miss important developments or trends.

Implications and Future Directions

The study effectively demonstrates the importance of maintaining an ongoing dialogue between academic researchers and practitioners. Future work might benefit from exploring alternative methodologies for measuring research impact and practitioner needs, perhaps incorporating more qualitative analysis or different metrics for research efforts.

This paper serves as a valuable compass for the field, pointing researchers toward areas that might benefit from increased attention while highlighting the importance of considering practical applications in academic research.