Your Scattered Public Data Can Be Assembled Into a Digital Twin of You — Researchers Just Proved It

Privacy advice tends to focus on the obvious: don’t post your home address, don’t share your birthday with strangers, use a pseudonym if you are worried about stalkers. What it rarely addresses is the aggregation problem — the fact that dozens of small, seemingly harmless disclosures scattered across multiple platforms can be computationally assembled into a profile far more revealing than any single one of them.

New research published in Springer Nature makes that problem concrete. Researchers built digital twin representations of real individuals by harvesting fragmented personal data from multiple social media platforms and feeding it into a pipeline of large language models, generative adversarial networks, and vision-language models. The result: detailed personal profiles constructed entirely from public data, with 91% precision in cross-platform identity matching.

The paper is a technical demonstration of something the data broker industry has known and monetised for a decade. But seeing it constructed from scratch in an academic setting, with documented methods and reproducible results, offers a clearer view of what is actually possible.

What the Researchers Built

The study used a dataset of 540,830 Twitter links harvested from Linktree — a service people use to aggregate their social media profiles in one place. Linktree links are public and intentionally shareable, making the dataset a naturalistic record of which platforms real users connect to their public identities. Instagram and Twitter showed the highest cross-platform linking rates, followed by YouTube, TikTok, and LinkedIn.

From that starting dataset, the researchers constructed cross-platform profiles: for each individual with multiple linked accounts, they aggregated text posts, images, self-descriptions, interaction patterns, and biographical fragments from each platform. No platform provided complete information on its own. Each one added a few more pieces.

The assembly pipeline had several stages. Language models extracted biographical signals from text — interests, location references, relationship status hints, professional background fragments. Vision-language models analysed images for consistent visual signals — the same face at different ages, consistent environments, identifiable background features. GANs generated synthetic versions of extracted images to enable matching without retaining the originals directly.

The cross-platform image matching achieved an F1 score of 91% and precision of 96%, meaning the system was highly accurate at confirming that an Instagram account and a Twitter account belonged to the same physical person — even when the accounts used different usernames, profile photos, and writing styles.

What “Digital Twin” Means Here

The researchers define a digital twin in this context as a virtual representation that captures an individual’s vulnerability to cyber threats — not a real-time simulation, but a constructed model of a person derived from their public data footprint. The twin encodes patterns: where you go, what you care about, who you interact with, what you look like across different contexts.

Once constructed, such a model can answer questions about you that you have never answered anywhere directly. What is the probable neighbourhood you live in? What times are you typically away from home? What family members can be identified? What are your political or religious views? What are potential entry points for social engineering?

The paper frames the exercise as privacy risk identification: build the twin to find vulnerabilities before an adversary does. That framing is legitimate — it is how penetration testers think about personal threat models. But the same methodology is available to anyone with the technical skills to run it, not just defenders.

The Aggregation Problem at Scale

The underlying issue is one that data protection law has struggled to address adequately. Individual data points collected under consent frameworks may be low-sensitivity in isolation. Your Instagram username is not sensitive. A Linktree link is not sensitive. A LinkedIn job title is not sensitive. A geotagged photo of a restaurant is not sensitive.

The combination is different in kind from any individual element. Knowing your neighbourhood, your workplace, your appearance across ages, your family structure, your daily schedule, and your social connections is a profile that enables targeted fraud, physical surveillance, stalking, and manipulation. None of those pieces individually would have triggered a meaningful consent decision.

GDPR addresses this indirectly through the concept of “indirect identification” — the idea that data that does not directly identify someone but could reasonably be used to identify them in combination should be treated as personal data. But enforcement of that principle in the context of cross-platform aggregation has been limited. Data controllers typically argue that they only hold one platform’s slice of data and cannot be responsible for what third parties construct from combinations.

That argument is becoming harder to sustain as the technical tools for aggregation become more accessible. The pipeline used in this study runs on commodity hardware and open-source models. No specialised resources are required.

What This Means for the Concept of “Public” Data

The most consequential finding in this research is not the technical accuracy rate — it is the implication for what “public” means. When a user posts on Instagram under their own name, they make a decision about disclosure on that platform, in that context. The contextual norm is: followers of this account see this. The contextual norm is not: any actor with API access can aggregate this with eleven other public sources to construct a detailed profile.

Privacy theorists call this contextual integrity — the idea that information flows appropriately when they match the norms of the context in which information was shared. Mass automated aggregation across platforms violates contextual integrity systematically, regardless of whether each individual data point was technically public.

The research paper demonstrates that violation is not hypothetical or marginal. It is achievable at scale with current technology.

For individuals, the practical implication is uncomfortable: reducing your aggregated digital footprint requires consistent action across every platform simultaneously, not just caution on any single one. Pseudonymity on one platform while using your real name on another may provide less protection than assumed if visual similarity matching can bridge the gap.

For regulators, the paper argues implicitly for an aggregation doctrine in privacy law: obligations that attach not to individual data points but to the act of combining them across contexts. Some legal frameworks gesture toward this. None have implemented it in a form that reaches the technical reality the researchers document.

Research: “Mirroring Privacy Risks with Digital Twins: When Pieces of Personal Data Suddenly Fit Together,” Springer Nature, DOI: 10.1007/s42979-024-03413-z.

What the Researchers Built

What “Digital Twin” Means Here

The Aggregation Problem at Scale

What This Means for the Concept of “Public” Data

Related Articles

FarSight: The Drone Surveillance System That Identifies You by Your Walk, Not Just Your Face

The AI Police State Is Already Here: How Axon, Flock Safety, and Motorola Are Watching You

ICE's $28.7 Billion AI Surveillance Machine: Palantir, Clearview, and the End of Anonymous Public Life