AI can really be used to tell fortunes scientifically! ? Danish scientists used public data on 6 million people across the country to train a Transformer-based model and successfully predicted accidental deaths and personality traits.
Researchers at the Technical University of Denmark (DTU) claim they have designed an artificial intelligence model that is said to be able to predict major events and outcomes in people's lives, including the approximate time of each person's death. The article appeared in Computational Science, a sub-journal of Nature, two days ago.
Author Sune Lehmann said, “We use this model to address a fundamental question: To what extent can we predict future events based on past conditions and events?”
It seems that the purpose of the author's research is nothing else. He really wants to use AI to help everyone tell fortunes.
Researchers characterize people's life trajectories as a time-ordered sequence of life events. This representation method has structural similarities with natural language.
Using the representation learning capability of the Transformer model, the semantic space of life events can be learned and a compact vector representation of individual life sequences can be generated.
The researchers used health and labor force data of approximately 6 million people in Denmark to build a Transformer-based model “life2vec”.
The input data of the model is an individual's birth time, location, education, health status, occupation and salary, etc., while the output data includes “accidental death” and “subtle differences in personality” and other content closely related to personal life.
The research team predicted individual life events based on life sequences, and the model performed significantly better than other current methods.
Compared with other methods, the life2vec model has better prediction results for subtle differences in personality.
The researchers further pointed out in the paper that both the conceptual space and the individual representation space of the model are meaningful and interpretable and can be used to generate new hypotheses and provide the possibility for individualized intervention.
Human life may be predictable
The core reason for the era of human prediction that humanity is currently experiencing is the emergence of massive data sets and powerful machine learning algorithms.
Over the past decade, machine learning has revolutionized the fields of image and text processing by making increasingly complex models possible by accessing increasingly larger data sets.
Language processing is evolving extremely rapidly, and Transformer architectures have proven successful in capturing complex patterns in large sequences of unstructured words.
Although these models have their origins in natural language processing, their ability to capture structure in human language generalizes to other sequences that share similar properties to language.
However, due to the lack of large-scale data, the Transformer model has not yet been applied to multi-modal socioeconomic data outside the industry.
The researchers' data set changes that. The sheer size of their dataset allowed the research team to build sequence-level representations of individual life trajectories, detailing how each person moved through time.
Researchers can observe how individuals' lives evolve over different types of events (information about a heart attack is mixed with information about a pay rise or a move from the city to the countryside).
The temporal resolution and total number of sequences within each sequence are large enough that researchers can meaningfully apply transformer-based models to predict the outcomes of life events.
This means that representation learning can be applied to an entirely new domain to develop new understandings of the evolution and predictability of human life.
Specifically, the researchers employed a Bert-like architecture to predict two very different aspects of human life: time to death and personality nuances.
The researchers found that the researchers' model could accurately predict these outcomes, in the case of early death, by ∼11% better than current state-of-the-art methods.
To make these accurate predictions, the researchers' model relies on a single common embedding space – the trajectory – of all events in life.
Just as studying embedded spaces in language models can provide new understanding of human language, researchers can study the concept of embedded spaces to reveal non-trivial interactions between life events.
Below, the researchers provide insights into the resulting conceptual space of life events and demonstrate the robustness and interpretability of this space and the model itself.
Transformer-based models also produce embeddings for individuals (an analogy in language representation is a vector summarizing the entire text). Using interpretability tools such as saliency maps and concept activation vectors (TCAV), the researchers show that individual summaries are also meaningful and have the potential to serve as behavioral phenotypes that can improve other individual-level prediction tasks, such as augmenting medical images. analysis.
Model prediction results
Researchers encode rich data using a simple symbolic language.
Raw data streams of complex multi-source temporal data pose significant methodological challenges, such as irregular sampling rates, data sparsity, complex interactions between features, and a large number of dimensions.
Classic methods for time series analysis (e.g., support vector machines, ARIMA) (42, 43) become cumbersome because they are scalable, inflexible, and require extensive data preprocessing to extract useful features.
Using transformation methods allows researchers to avoid handcrafted features and instead code the data in a way that exploits similarities to language. Specifically, in the researchers' case, each category of discrete features and discrete continuous features formed a vocabulary.
This vocabulary – along with the coding of time – allows researchers to characterize each life event (including its detailed qualifying information) as a sentence composed of compound words or concept symbols.
The researchers attached two time metrics to each event. One specifies the individual's age at the time of the event and the other captures the absolute time, see image below.
The researchers' synthetic language could therefore capture information like: “In September 2020, Francesco received 20,000 DKK while working as a guard at a castle in Elsinore.”
Or “In her third year at boarding school, Hermione took five electives.” In this sense, a person's life process is characterized as a string of such sentences, which together constitute the individual's life sequence.
The researcher's approach allowed the researcher to code extensive details about events in an individual's life without sacrificing the content and structure of the original data.
life2vec model
Researchers use transformer models to form compact representations of individual lives. The researchers call the researchers' deep learning model life2vec.
The Life2vec model is based on the transformer architecture. Due to its ability to compress contextual information and take into account temporal and location information, the Transformer is well suited for characterizing life sequences.
The training of Life2vec is divided into two stages.First, the researchers used both
(1) A masked language model (MLM) task that forces the model to use token representations and contextual information.
(2) A sequence ranking prediction (SOP) task that focuses on the temporal coherence of sequences (to train the model). Pre-training creates a concept space and teaches the model the patterns in the sequence structure.
Next, in order to create a compact representation of an individual's life sequence, the model performed a classification task. The individual summary the model learns in this final step depends on the classification task; it identifies and condenses patterns that maximize certainty for a given downstream task.
For example, when researchers ask a model to predict a person's personality nuances, the person embedding space will be built around the key dimensions that contribute to personality.
Accurate forecasting across domains
The first test of any model is predictive performance. life2vec not only surpasses existing SOTA, but also can perform classification predictions in very different fields. The researchers tested their framework on two different tasks.
Predict early mortality
Researchers estimated a person's likelihood of surviving four years after January 1, 2016. This is a commonly used task in statistical modeling. Furthermore, mortality prediction is closely related to other health prediction tasks, so life2vec is required to model the development of individual health sequences as well as labor history to successfully predict correct outcomes.
Specifically, given a sequence representation, life2vec infers the likelihood that a person will survive four years after the researcher's sequence ends (January 1, 2016).
The researchers focused on forecasting younger age groups, including individuals aged 30 to 55, for whom mortality is difficult to predict.
The researchers demonstrate the performance of a model using the modified Matthews correlation coefficient, C-MCC61, which adjusts the MCC value due to the presence of unlabeled samples.
Life2vec outperforms the baseline by 11%. Note that increasing the size of RNN models does not improve their performance.
The following figure 2.D The performance of various subgroups is also broken down: crossover groups based on age and gender, and groups based on sequence length.
Predicting the nuances of personality
Death as a predictive target is well defined and very measurable.
To test the versatility of life2vec, the researchers are now predicting “personality nuances,” which are the results at the other end of the measurement spectrum, something within an individual that can typically be measured with questionnaires.
Although difficult to measure, personality is an important characteristic that shapes people's thoughts, emotions, and behaviors and predicts life outcomes. Specifically, the researchers focused on personality nuances in the domain of the introversion-extroversion dimension (for simplicity, extroversion below), since the corresponding personality nuances are nearly all of the basic personality structures that have emerged over the last century (in the Western world) part of a comprehensive model.
As the researchers' data set, the researchers used data collected on a large representative group of individuals in the Danish Personality and Social Behavior Panel (POSAP) study.
The researchers randomly selected one item (personality nuance) for each extroversion facet and predicted individual-level answers.
The figure above shows that applying Life2vec to life sequences not only allows researchers to predict early mortality, but is general enough to capture the nuances of personality).
Life2vec scores higher than RNN in all items, but only in items 2 and 3 the difference is statistically significant. The fact that an RNN trained for this specific task was also able to extract signals around personalities highlights that, despite the power of the transformer model, a large part of what makes Life 2vec so versatile is the dataset itself.
Concept space: understanding the relationships between concepts
The novelty of the researchers' approach is that the algorithm learns a single joint multidimensional space that contains all events that may occur in human life. The researchers' exploration of this space began with visualization.
global vision
In the image above, PaCMAP is used to project the original 280-dimensional concept onto a two-dimensional map that preserves the local and global structure of the high-dimensional space.
Here, each concept is colored according to its type.
This color makes it clear that the overall structure is organized according to the key concepts of the composite language: health, job type, etc., but interesting details separate out the year of birth, income, social status and other key demographic information. The structure of this space is highly robust and reliably repeats itself across a range of conditions.
The fine structure of conceptual space is meaningful. Digging deeper into the global layout, the researchers found that the model learned intricate connections between nearby concepts.
The researchers studied these local structures through neighbor analysis, which exploits the cosine distance between concepts in the original high-dimensional representation as a similarity measure.
personal summary
A summary is a single vector that summarizes essential aspects of a person's entire sequence of life events.
Personal summaries span the space in which the researcher's person is embedded. To form a human summary, the model determines which aspects are relevant to the task at hand. In this sense, person summaries are conditioned on a specific prediction task. Below, the researchers focus on person summaries of the likelihood of death.
The image above visualizes the space for personal profiles.
Relative to mortality predictions, the model organizes individuals on a continuum from low to high estimated mortality (point cloud in group D).
In the figure, the researchers show true deaths through red diamonds, while the confidence of predictions is represented by the radius of the points (e.g., points with small radii are low-confidence predictions).
Additionally, a color map from yellow to green is used to display the estimated probabilities.
The researchers saw that while Region 2 was mostly populated by older adults, but still saw a large proportion of young adults (Figure 5E), it contained a small fraction of real targets (Figure 5F).
Area B has a largely opposite structure, with mostly young people but also a significant number of older people (Fig. 5E), and only one person actually deceased (Fig. 5F).
When the researchers looked at actual deaths in the low-probability areas, the researchers found that the five closest causes of death to area 1 were as follows – two accidents, brain malignancy, cervical malignancy, and myocardial infarction.
References:
https://arxiv.org/abs/2306.03009