ACM Web Conference 2025: Marcelo Mendoza presents AI method that improves information extraction in news – Millennium Institute Foundational Research on Data

It was precisely in Sydney, Australia, where Marcelo Mendoza, DCC UC academic and principal researcher at CENIA and associate researcher at IMFD, presented two new techniques for language models such as GPT-4o, Claude, and Gemini to extract key data from news articles and formulate questions and answers like a human reader. The study was developed in collaboration with Hans Löbel, DCC – Transporte UC adjunct professor and CENIA researcher, Brian Keith from the Catholic University of the North, and doctoral student Carlos Muñoz.

In a journalistic context, extracting key information from news articles—organized around the questions "Who," "What," "When," "Where," "Why," and "How" (5W1H)—has been a fundamental strategy in digital journalism to enhance search systems. With the rise of large language models (LLMs)—such as GPT (OpenAI), Gemini (Google), and Claude (Anthropic), among others—there has been renewed interest in their potential to perform information extraction tasks more effectively.

Marcelo Mendoza presented a research paper entitled "Imitating Human Reasoning to Extract 5W1H in News" at The ACM Web Conference 2025, which took place from April 28 to May 2 in Sydney, Australia. The research proposes an approach that seeks to improve the automatic extraction of information in journalistic statements (5W1H), using language models and focusing particularly on their ability to imitate human reasoning.

The research introduces two new Chain of Thought (COT) techniques in AI models that have the ability to reason imitatively when performing complex tasks. The research proposes the use of extractive reasoning, which directs the language model (LLM) to identify and highlight relevant details directly in the text, and question-level reasoning, which guides the model to formulate and answer questions as a human reader would.

Experiments conducted with state-of-the-art language models (LLMs) demonstrated that the proposed COT techniques significantly outperform traditional extraction methods.

According to statements made by academic Marcelo Mendoza on the National Artificial Intelligence Center website, he states: "The results of this study have the potential to transform the way automated systems process news, facilitating more accurate searches and better organization of information on the web."

Source: https://dcc.ing.uc.cl/acm-web-conference-2025-marcelo-mendoza-presenta-metodo-de-ia-que-mejora-extraccion-de-informacion-en-noticias/