How Deeply Do LLMs Internalize Scientific Literature And Citations?

Do Language Models Think Like Scientists? A Citation-Level Investigation

Andres Algaba Vincent Holst Floriano Tori Melika Mobini Brecht Verbeken Sylvia Wenmackers Vincent Ginis

Get All The Latest to Your Inbox!

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Inquire Now

A new study asks a deceptively simple question: when you ask a large language model to suggest references for a paper, does it think like a scientist?

The Harvard and CLPS authors probe this by prompting GPT-4o to generate reference lists and then comparing those suggestions to real citations across 10,000 papers sampled from SciSciNet, a large open data lake of the science of science literature (Lin et al., 2023). The work appears as a preprint (Algaba et al., 2025) and is accompanied by open data and code.

The headline result is that GPT-4o’s existing suggested references are strongly skewed toward already highly cited work, reinforcing the Matthew effect. Yet the model also exhibits human-like topical alignment and even reproduces structural features of citation networks. That duality matters as LLMs move from drafting assistants to embedded agents in research search and synthesis tools.

Key Takeaways

Across 274,951 AI-suggested references, GPT-4o consistently favors highly cited papers among those that match real records, reinforcing the Matthew effect (Figure 3).

Existence rates (share of generated references that truly exist) are roughly 40–50% overall and vary by field, with higher rates in humanities and social sciences (Figure 3a, Figure A3).

Generated references tend to be more recent, have shorter titles, and fewer authors than ground truth citations, with strongest brevity and small-team preferences among the existing subset (Figure 4).

Content relevance is strong: cosine similarity between AI-suggested references and focal paper abstracts is on par with real references and far above random baselines (Figure 5; Figure A8).

Local citation graphs formed from existing AI-suggested references mirror human citation network structure and differ clearly from random baselines (Figure 6; Figure A10–A11).

Self-citation is lower among LLM suggestions than in ground truth references, hinting at a different bias profile (Figure A9).

Figure 1: Overview of our experiment comparing the characteristics of human citations and LLM generated references, when tasked to suggest references based on the title, authors, year, venue, and abstract of a paper. We sample 10,000 focal papers from all SciSciNet [58] papers which are published in Q1 journals between 1999 and 2021, have in between 3 and 54 references, and have at least 1 or more citations (n=17,538,900). We prompt GPT-4o to generate suggestions of references based on the title, authors, year, venue, and abstract of a focal paper, where the number of requested generated references corresponds to the ground truth number of references made in the focal paper, which amounts to a total of 274,951 references. We verify the existence of the generated references via the SciSciNet [58] database and compare the characteristics, such as title length, publication year, venue, number of authors, and semantic embeddings, of the existing and non-existent generated references with the ground truth. For the existing generated references, we also compare additional characteristics, such as the number of citations and references, and analyze the properties of their citation networks. Credit: Algaba et al.

How The Study Works

The authors assemble 10,000 focal papers from SciSciNet, restricted to Q1 journals between 1999 and 2021, each with 3–54 references and at least one citation. For each focal paper, GPT-4o receives only the title, authors, year, venue, and abstract and is asked to propose the same number of references as the paper’s actual reference list.

This process yielded 274,951 suggested references. The team then checked whether each suggestion exists in SciSciNet using Elasticsearch retrieval plus strict fuzzy matching on titles and authors, classifying matches as “existing” and the rest as “non-existent” (Shown in the paper's Appendix).

Crucially, the analysis does not stop at existence. The authors compare the bibliometric and textual characteristics of four groups:

Ground truth references
All generated references
Existing generated references
Non-existent generated references.

They examine publication year, number of authors, and title length; they analyze journal patterns; and they compute semantic alignment via cosine similarity between textual embeddings of focal paper abstracts and reference titles. They also construct local citation graphs for each focal paper to assess whether structural patterns produced by AI suggestions resemble those of human-curated references.

By design, the setup isolates the model’s parametric knowledge, without external search or retrieval. That choice makes the results a clean probe of what the model has internalized about scientific literature and citation practice, while acknowledging that interactive, retrieval-augmented workflows could look different in practice (Lewis et al., 2020).

Why This Matters

LLMs are rapidly entering research workflows including literature triage, systematic review screening, hypothesis ideation, and even autonomous discovery assistants. If models preferentially surface already well-cited, recent, concise, and high-profile venue papers, they could amplify existing inequities in attention and visibility.

The study shows these biases are not random. They are stable across years and many fields, even as existence rates vary by domain. Tool designers should account for these tendencies and consider countermeasures (for example, calibrated diversification or explicit long-tail exploration) when using LLMs for citation recommendation.

At the same time, the strong topical alignment and network realism are promising. The model’s suggestions are semantically appropriate on average and produce citation graphs that look like real ones, not random shuffles.

That suggests there is usable signal in LLM-only reference generation that, if coupled with retrieval and verification, could help researchers surface relevant but not-yet-cited connections. The lower self-citation rate among LLM suggestions is also notable: it may reduce certain human biases, even while reinforcing others.

What The Figures Show

Figure 2: Descriptive statistics of the focal paper sample. This figure summarizes key characteristics of the focal paper sample (n=10,000) across fields. a, The distribution of focal papers by field highlights strong representation in the exact sciences (biology, chemistry, computer science, environmental science, engineering, geography, geology, materials science, mathematics, medicine, and physics), with comparatively fewer papers in the humanities (art, history, and philosophy) and the social sciences (business, economics, political science, psychology, and sociology) (Appendix Table A1). b, The temporal trend in the number of focal papers exhibits linear growth from 1999 until 2021, which aligns with full SciSciNet [58] database for this period. c,d, Both the median number of references cited per focal paper and the median team size are increasing over time. This pattern is more clear in the fields with a larger number of focal papers (e.g., biology, chemistry, and medicine). The color intensity represents the magnitude of the values: darker shades indicate higher numbers, while lighter shades represent lower values. Hatched cells indicate no data available for a given year and field. Credit: Algaba et al.

Figure 2 characterizes the 10,000 focal papers: predominance of exact sciences (medicine, biology, chemistry), linear growth in sampled papers over 1999–2021 consistent with SciSciNet coverage, and rising median team sizes and reference counts over time. These set the stage and help interpret field-level differences downstream.

Figure 3: Existing generated references reinforce the Matthew effect in citations. This figure displays the existence rate of generated references (gray, n=274,497), and the citation characteristics of the ground truth (blue, n=274,951) and existing generated (orange, n=116,939) references across fields and time. Error bars and shaded bands represent 95% confidence intervals. a, The existence rate of generated references by field of the focal paper shows significantly lower values in the exact sciences compared to the humanities and the social sciences. b, Median citation counts reveal that existing generated references tend to have higher citation counts across all fields, suggesting a preference toward already highly cited works. The pairwise two-sided Wilcoxon signed-rank test at the focal paper level confirms that the existing generated references have a statistically significant higher median citation count for all fields (history, p=0.003; philosophy, p=0.022: all other fields, p<0.001). c, The median reference counts tend to be more similar for many fields, with only political science showing existing generated references to have a lower median number reference count. The pairwise two-sided Wilcoxon signed-rank test at the focal paper level shows that the existing generated references have a statistically significant higher median reference count for biology (p<0.001), chemistry (p<0.001), environmental science (p<0.001), geography (p=0.007), materials science (p<0.001), mathematics (p<0.001), medicine (p<0.001), psychology (p<0.001), and sociology (p=0.002). All other fields show no statistically significant difference (p>0.05). d, Temporal trends at the focal paper level show that existing generated references consistently exhibit higher median citation counts compared to ground truth references, further emphasizing the reinforcement of the Matthew Effect in citations. e, The overall existence rate of generated references remains consistent across the publication year of the focal paper, fluctuating between 40% and 50%. Credit: Algaba et al.

Figure 3 is central. Panel 3a shows field-level existence rates for AI suggestions: lower in exact sciences than in humanities and social sciences, but fairly stable over focal paper years (panel 3e). The existence-rate variation persists under subsampling controls (Figure A1) and declines for very recent references (Figure A3a). Panels 3b and 3d show that existing AI-suggested references have higher median citation counts than ground truth across fields and over time, a robust Matthew effect signal. The gap remains even when matching the number of top-cited ground truth references per focal paper (Figure A4) and across normalized and time-windowed citation metrics (Figure A5). Panel 3c shows reference counts are similar, so the effect is not due to longer bibliographies.

Figure 4: Generated references exhibit a systematic preference for more recent references with shorter titles and fewer authors. This figure summarizes key characteristics for ground truth (blue, n=274,951), generated (green, n=274,497), existing generated (orange, n=116,939), and non-existing generated (red, n=157,558) references. a, The relative frequency of publication years within each reference group, with median publication years indicated by vertical lines, shows that generated references are generally more recent than the ground truth. This recency bias is driven by non-existent generated references, which disproportionately cite more recent publications. Existing generated references show a more complex pattern, tempering the overall recency bias in the generated references. The pairwise two-sided Wilcoxon signed-rank test at the focal paper level confirms the statistically significant difference in median publication year between ground truth and generated references (p<0.001). b, The distribution of the number of authors shows that generated references tend to favor documents with fewer authors with a peak aroud 2-3 (1-3 for existing generated references) authors, compared to 2-6 authors for ground truth references. A small proportion of generated references are labeled as “et al.” (3%), with higher rates in non-existent (4%) than existing generated references (1.5%). The pairwise two-sided Wilcoxon signed-rank test at the focal paper level confirms the statistically significant difference in the median number of authors between ground truth and generated references (p<0.001). c, The distribution of the title length shows that generated references tend to favor documents with shorter titles. This effect is most outspoken for the existing generated references. The pairwise two-sided Wilcoxon signed-rank test at the focal paper level confirms the statistically significant difference in the median title length between ground truth and generated references (p<0.001). d, The journal rankings show the top 10 journals across different reference groups. The size of each dot represents how relatively frequently that journal appears within its reference group. Journals are connected by solid lines when appearing in all three groups’ top 10, dotted lines when appearing in two groups’ top 10, and shown in italic font when appearing in only one group’s top 10, highlighting the distinct citation patterns across reference types. Credit: Algaba et al

Figure 4 documents systematic preferences in generated references. Panel 4a shows a recency skew, especially pronounced in non-existent suggestions. Panel 4b shows fewer authors: AI suggestions concentrate on 1–3 authors, while human references spread over larger teams. Panel 4c shows shorter titles, most strongly among existing suggestions. Panel 4d highlights venue patterns: strong concentration in top multidisciplinary and field-leading journals, with some notable differences from ground truth.

Figure 5: Generated references exhibit a level of cosine similarity to focal paper titles and abstracts on par with ground truth references, surpassing that of a random ground truth reference list from the same field. This figure displays the distributions of the pairwise cosine similarity between OpenAI text-embedding-3-large vector embeddings (size=3,072) of the titles of the ground truth (blue, n=274,951), generated (green, n=274,497), existing generated (orange, n=116,939), and non-existing generated (red, n=157,558) references with the title and abstract of their corresponding focal paper (n=10,000). As a benchmark, we also compute for each focal paper the pairwise cosine similarity with the reference from a random ground truth reference list from the same field (gray, n=274,951). Credit: Algaba et al

Figure 5 turns to content relevance. Cosine similarity distributions between focal abstracts and references show AI-suggested lists are as aligned as ground truth and far above random baselines. This holds across embedding choices and input scopes (Figure A8).

Figure 6: The citation graphs of the generated references are structurally similar to the citation graphs of the ground truth references, deviating significantly from random baselines. This figure displays that the local citation graphs corresponding to the references generated by GPT4-o exhibit structural similarities with the local citation graphs corresponding to the ground truth references across several graph graph based measures, see the appendix for a detailed definitions of all the involved notions. Notably, these similarities cannot be found in local citation graphs corresponding to randomly reshuffled ground truth reference lists. For each of the 10,000 focal papers, three local citation graphs were constructed. The graphs were based on the ground truth references of the focal paper, the GPT4-o generated references that exist (adding necessary edges to maintain connectivity for references not originally cited by the focal paper), and a random baseline created by reshuffling the ground truth references (while preserving the field of study of the focal paper). Since approximately 50% of the GPT4-o-generated references exist (see Fig. 3a), the initial GPT4-o-based graphs had fewer nodes. To ensure a fair comparison, a random subset of nodes was removed from both the ground truth and random graphs, yielding three graphs of equal size per focal paper. All of these graphs were then converted to undirected graphs for analytical simplicity. A detailed description of the pipeline can be found in Appendix Figure A10. a, The resulting node counts are identical for the GPT4-o (blue), ground truth (red), and random (gray) graphs at the focal paper level. b, The number of edges in GPT4-o-generated graphs closely aligns with human citation patterns, deviating significantly from the random baseline. c, GPT4-o-based citation graphs also mirror the distribution of average shortest path length in human citation networks, whereas the random graphs deviate substantially. d–h, The analogous observation for the mean degree centrality, density, average clustering coefficient, average closeness centrality, and standard deviation of the eigenvector centrality, highlighting GPT4-o’s strong internalization of human citation patterns and second-order graph structures. Credit: Algaba et al

Figure 6 moves beyond first-order statistics to network structure. Local citation graphs formed from existing AI-suggested references mirror ground truth across average shortest path length, degree centrality, density, clustering, closeness, and eigenvector centrality dispersion, and deviate clearly from random shuffles. Appendix Figure A10 details the graph-construction pipeline; Figure A11 shows AI-suggested nodes connect meaningfully to ground truth nodes rather than forming isolated clusters.

Interpretation And Limitations

The reinforcement of the Matthew effect likely reflects two forces. First, the model’s training data skews toward widely read, frequently cited works. Second, high-impact papers tend to be broadly relevant and semantically central, nudging the model toward them when optimizing for topical fit.

The observed preferences for shorter titles and smaller teams may stem from learnability and memorability constraints in parametric models, as the authors note. Field differences in existence rates plausibly arise from discipline-specific publication formats, time-to-indexing, and varying coverage in bibliometric sources.

The study is careful to isolate parametric knowledge, which is a strength for measurement and a limitation for deployment realism. In real tools, retrieval, metadata verification, and re-ranking are common.

Those steps will likely raise existence rates and could mitigate some biases. Still, the core results here should inform system design: model-internal priors can amplify attention to already prominent work unless explicitly counterbalanced. The positive result on network realism suggests that combining model priors with retrieval may surface non-obvious but legitimate connections.

Reproducibility And Resources

The paper’s code repository is public and the enlarged dataset builds on SciSciNet. The authors retrieved abstracts using CrossRef and Semantic Scholar APIs and validated existence via Elasticsearch and fuzzy matching. For readers and tool builders, these resources are good anchors for replication and extension.

Primary sources:

(1) The paper: (Algaba et al., 2025)

(2) Code repository: (Algaba et al., 2025)

(3) SciSciNet dataset DOI: (Lin et al., 2023)

(4) CrossRef: (CrossRef, 2024)

(5) Semantic Scholar Open Data Platform: (Kinney et al., 2023)

in Papers

# Citations LLM Matthew Effect Science of Science

DOI: 10.48550/arXiv.2504.02767

Authors:

Andres Algaba Vincent Holst Floriano Tori Melika Mobini Brecht Verbeken Sylvia Wenmackers Vincent Ginis

Number of Pages: 32

Joshua Berkowitz November 3, 2025

Views 1859

Share this post

blogs

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Try It

Most Popular Articles

Check out what the hot topics are!