TimeStress

TimeStress is a dataset designed to evaluate the temporal representation of facts in large language models (LLMs) by assessing their ability to distinguish between correct and incorrect factual statements contextualized with a date and formatted as questions, such as “In 2011, who was the president of the USA? Barack Obama”. The evaluation principle is that the probability assigned by the model to the correct answer should be higher when the date is accurate compared to when it is not.

TimeStress includes numerous correct and incorrect statements, with each date expressed in three different precisions. This allows for the evaluation of LLMs along two dimensions: by varying the date on the timeline and by adjusting the precision.

Findings highlight several limitations of LLMs, including their difficulty in achieving comprehensive knowledge of popular temporal facts and their struggle to transfer knowledge across different date precisions; for instance, an LLM may recognise a fact when asked a question contextualised with year dates (ex. 2020) but fail when the question is framed with specific months (ex. march 2020).

This research has been published as a preprint article on ArXiv, and the goal of this repository is to reproduce the experiments.

It contains the necessary code to

  • Regenerate TimeStress from scratch using a Wikidata dump and GPT-4o
  • Collect predictions from 18 studied LLMs on TimeStress
  • Analyze the behavior of LLMs to draw conclusions about the consistency of their temporal representation of facts. The figures and tables from the article are generated in this step

Code source available on GitHub, under the MIT licence.