Natural Language Processing

Natural language processing (NLP) is ubiquitous in today’s world. As a field, NLP lies at the intersection of computer science, artificial intelligence, and linguistics with a focus on enabling machines to understand, generate, and interact with human language. Ranging from rule-based models, statistical methods, and, more recently, deep learning approaches such as large language models (LLMs), NLP techniques serve as a foundation for most modern technologies, such as language translators, search engines, and virtual assistants powering our smart home devices, code editors, and more. NLP research in our section is headed by Akhil Arora and his group CLAN for AI Research on Language and Networks (or “CLAN” for short). CLAN focuses on making language technologies more human-centered, robust, trustworthy, accessible, and impactful in real-world settings. Their work can be broadly categorized in the following five directions.

Knowledge-seeking Agents
- Devise models to understand how people search for information online, in particular to simulate user browsing behavior in Wikipedia in a privacy-preserving way
- Build LLM-based knowledge-seeking agents
Identifying and Bridging Knowledge Gaps
- Explore underrepresented and neglected Wikipedia pages, uncovering structural and editorial knowledge gaps
- Devise generative LLMs to support creation and maintenance of knowledge resources like Wikipedia
Enhancing Reliability and Efficiency of LLM-based Reasoning
- Devise reasoning frameworks that integrate LLMs with source retrieval and reading recommendations designed to reduce misinformation and improve verifiability
- Devise algorithms to improve the cost-quality trade-off of LLM reasoning frameworks
Grounded and Situated Language Understanding
- Exploring how structured (e.g. knowledge graphs) and unstructured knowledge interact
- Developing models that link language to structured knowledge, such as graphs, and ontologies
- Work on entity linking, relation extraction, and knowledge base completion
Multilingual and Cross-lingual NLP:
- Identifying and mitigating biases in NLP systems, especially those affecting underrepresented groups or regions and low-resource languages
- Projects on cross-lingual transfer and multilingual embeddings
- Collaborations with Wikimedia Foundation to understand geographic and cultural biases in online knowledge

Social impact

CLAN’s NLP research has significant social and practical impact by addressing real-world challenges in how people access, share, and contribute to knowledge online. Socially, their work empowers millions of users—especially on platforms like Wikipedia—by making information more accessible, reliable, and inclusive. For example, their models help identify and bridge knowledge gaps (e.g., neglected “orphan articles”), promote fact-checked content through LLM reasoning tools, and respect user privacy through synthetic behavior modeling. These advances support better information equity, especially in underserved languages and topics. Practically, CLAN’s tools and frameworks improve the design of human-in-the-loop AI systems, such as intelligent assistants that can explain sources or guide users through complex information spaces. Their collaborations with Wikipedia and focus on verifiable LLM outputs also contribute to combating misinformation—an increasingly critical concern. Furthermore, their research focused on making AI technologies more sustainable by improving the cost-quality trade-off of LLM-based reasoning frameworks. Overall, their research bridges the gap between state-of-the-art language models and public-good applications, making NLP not just powerful, but meaningfully useful.

Key publications

Lars Klein, Nearchos Potamitis, R. Aydin, Robert West, Caglar Gulcehre, Akhil Arora
Fleet of Agents: Coordinated Problem Solving with Large Language Models (ICML 2025)

Amalie Brogaard Pauli, Isabelle Augenstein, Ira Assent
Measuring and Benchmarking Large Language Models’ Capabilities to Generate Persuasive Language (NAACL 2025)

Tomás Feith, Akhil Arora, Martin Gerlach, Debjit Paul, Robert West
Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia (EMNLP 2024)

Marko Čuljak, Andreas Spitz, Robert West, Akhil Arora
Strong Heuristics for Named Entity Linking (NAACL-SRW 2022)

Akhil Arora, Alberto Garcia-Duran, Robert West
Low-Rank Subspaces for Unsupervised Entity Linking (EMNLP 2021)

Researchers

Akhil Arora

Assistant professor

M akhil.arora@cs.au.dk
H 5335, 355

CMU Advanced NLP Spring 2025: Introduction to NLP

Current Projects and Labs

CLAN for AI Research on Language and Networks

Grants: Novo Nordisk Foundation Start Package Grant (NNF24OC0099109)

Revised 01.09.2025

Ankita Atrey