Natural Language Clinical Pathways for Automated Coding of Topography and Morphology in Cancer Registries: Leveraging Healthcare Dataflows through the LN-PDTA Algorithm

Adele Zanfino, Carlotta Buzzoni, Antonio Giampiero Russo

ðŸ‡®ðŸ‡¹ Leggi la versione italiana

Background

Cancer registries (CRs) are a key tool for research and public health in the field of oncology, as they collect, analyse, and archive, in a structured and systematic manner, information on cancer cases occurring within a specific population or geographical area. Indicators such as incidence, prevalence, survival, and mortality are essential for describing epidemiology, evaluating organised screening programmes, guiding healthcare planning and monitoring the impact of preventive and therapeutic measures.¹
To ensure the validity and comparability of data, it is crucial that clinical and pathological information is coded accurately and in accordance with international standards. The guidelines developed by the International Agency for Research on Cancer (IARC) and the European Commission’s Joint Research Centre (JRC) set out the standard procedures for cancer registration.^2-4 These guidelines, based on a structured system of rules and classifications, guide the work of the CRs in the detailed assessment of each individual case, setting out the criteria for determining the date of onset, the anatomical location of the tumour (topography), the histological type (morphology), the histopathological grade, and the stage at diagnosis. Today, this process is managed primarily by manual means, carried out by cancer registrars: in particular, the topography and morphology of each tumour are determined by interpreting the information contained in histopathological reports and medical records, using standard coding systems in accordance with ICD-O-3.⁵ Manual coding ensures control and verification, but requires considerable time and expert staff; furthermore, it is subject to intra- and inter-operator variability, which stems from individual interpretations of detailed reports and can only be mitigated through ongoing training and peer review.
Although international standards set strict time limits – such as the 23-month deadline for finalising annual incidence data suggested by the North American Association of Central Cancer Registries⁶ or the three-year deadline stipulated by the National Cancer Plan for regions to submit data to the National Cancer Registry⁷ – the growing complexity of the data and the increasing workload often make it difficult to meet these deadlines. To manage the increase in data volumes, speed up the registration process, lighten the workload of data collectors and preserve the completeness and accuracy of the data collected, several Cancer Registries have introduced forms of automated coding. In a hybrid approach, some software analyses the free-text sections of pathology reports to suggest the anatomical site and morphology; subsequently, a validation operator checks the suggested matches, correcting them where necessary.^8-10 Notable examples of this approach can be found in the US “Surveillance, Epidemiology, and End Results” (SEER) Program in the United States, which has been using a system (SEER*DMS) for the pre-processing of pathology reports for several years now, and in the Netherlands Cancer Registry, where a semi-automated module has been shown to reduce coding times by up to 30-40%. Similar experiences, albeit more recent, have been reported by the National Cancer Registration and Analysis Service (NCRAS) in the United Kingdom, where the proportion of reports successfully coded automatically has stood at around 80-90% for the most common cancers.^11-13 In Italy, regional pilot projects have reported concordance rates with fully manual coding of over 70-80%.^14-16 However, these algorithms are limited to predicting the topography, leaving the morphology to manual coding.^11,16
The incorrect classification of tumour types as well as the presence of missed cases are often due to inaccurate data sources or patients without hospital admissions; incorporating additional administrative data sources can improve the sensitivity and specificity of the process.⁹
In the context of these automated coding approaches, machine learning (ML) – one of the applications of artificial intelligence (AI) – is transforming medicine by providing advanced tools for the analysis and management of healthcare data, from diagnostics to risk prediction. In oncology, ML models can automate coding: supervised learning and deep learning integrate data from electronic medical records, hospital discharge records (HDR) and other dataflows, detecting complex patterns useful for identifying topography and morphology.
This paper introduces the concept of the natural language diagnostic-therapeutic-care pathway (LN-PDTA): a concise, chronological, and semantic sequence of the care events that describe the clinical pathway of each cancer patient. The main objectives are the development and validation of an algorithm based on long short-term memory (LSTM) recurrent neural networks which, by analysing the LN-PDTA string generated for each incident case, automatically assigns the topography-morphology combination. This approach tackles data heterogeneity, is scalable, reproducible and learns from real data. The model is designed to integrate into the routine processes of cancer registries, speeding up closure times and expanding analytical capacity to support clinical and epidemiological decisions.

Methods

Definition of the study population and inclusion criteria

The study was conducted on new cancer cases recorded by the he Cancer Registry (CR) of the Agency for Health Protection of the Metropolitan Area of Milan (ATS Milano) between 1 January 2017 and 31 December 2018, involving residents within the ATS Milano area, which comprises the provinces of Milan and Lodi, with a total population of approximately 3.5 million. Malignant tumours were selected, excluding non-melanoma skin cancers and individuals with multiple tumours.
This project, relating to the consolidation and optimisation of the registration process, is included among the institutional objectives of the Cancer Registry (CR) of the ATS Milano. The legal basis for data processing is defined by law (Lombardy Regional Council Decree No. XI/6818 of 02.08.2022).

Data sources and time frame

Using deterministic record linkage based on a unique anonymised code, each case was linked to demographic information (sex and age), healthcare services from hospital discharge records (HDR), outpatient services (28/SAN), outpatient and inpatient drug dispensations, the Nominal Register of Causes of Death (ReNCaM) and coded pathological anatomy reports (PA). Events occurring within 180 days before or after the date of incidence (the earliest of: date of hospital admission with a diagnosis of neoplasm, date of the pathological examination, date of clinical/instrumental diagnosis, date of death from cancer) were considered. From the data sources, only information directly related to the diagnosis and treatment of the tumour was selected, as detailed below.

Extraction of tumour diagnoses and surgical procedures from the hospital admission dataflow

Within the hospital admission data flow (including inter-regional transfers), diagnosis codes referring to malignant tumours with a defined site (ICD-9-CM: 140*-194*; 200*-208*) were searched for; these codes identify potential admissions for solid and haematological neoplasms, excluding malignant tumours of other or ill-defined sites. Only the principal diagnosis and the first secondary diagnosis were considered, selecting the oncological one or, if both were oncological, the principal one. In the case of multiple admissions, the one closest to the date of incidence was considered.
With regard to the procedure, ICD-9-CM procedure codes were searched for in the ‘main procedure’ field within a predefined list. The selection was made by considering, depending on the procedure group, the first three or four digits of the code, in order to identify specific types of procedure (Table S1, online supplementary materials). In the case of multiple hospital admissions containing eligible procedures, the procedure closest to the date of onset was considered.
Furthermore, for each subject, the identification code of the hospital was recorded; in the case of multiple hospital admissions, priority was given to the hospital where the selected surgical procedure was performed.

Cancer treatment definition

The provision of non-surgical treatments was reconstructed using data flows relating to hospital admissions, pharmaceuticals, and outpatient services, regardless of source. For each treatment, specific procedure codes and drugs were identified (Table S2, online supplementary materials). The procedures were recoded into three categories: radiotherapy, chemotherapy and hormone therapy.
A distinction was also made between pre- and post-operative chemotherapy; where no surgery was performed, palliative chemotherapy was included in the ‘pre-operative chemo’ category. For each patient and treatment, the event closest to the date of incidence was considered, taking into account only the first cycle in the case of repeated treatments.

Identification of drugs targeting specific cancers

From File F (specialist and/or high-cost pharmaceutical dispensing provided directly by inpatient facilities), all prescriptions with an ATC code belonging to group L01* (antineoplastics) or L02* (hormone therapy) were included, taking into account the full ATC code (7 characters). If the patient had received multiple dispensations with the same ATC code, only the one closest to the date of incidence was considered.

Association with cancer as the primary cause of death

For each deceased individual in the cohort, the cause of cancer-related death (ICD-10 codes C00–D48) was recorded, taking into account only the primary cause.

Standardisation and selection of pathological reports

From the PA data flow, tumour reports with a morphology coded as M8* or M9* were selected, excluding benign, uncertain, and metastatic tumours (where the last character was not 3). The SNOMED topography and morphology fields were normalised (removing spaces and special characters) and converted to ICD-O-3. For topography, only the first three characters of the code were considered, except in cases where the second character was A, C, D, F, Y, or X: in these cases, the first four characters were considered. In codes where the second and third characters were EA, the entire code was considered.

Structuring information for prediction

Each case was assigned the topography-morphology string defined by the cancer registrars, consisting of the first 3 and 5 characters of topography and morphology respectively, according to the ICD-O-3 classification. For brevity, the CR-coded pair will hereinafter be referred to as ‘topo-morpho’. To allow for analysis at a more aggregated level, individual morphologies were also grouped into broader categories (morphological groups) according to the correspondences detailed in Table S3 (online supplementary materials).
Each event detected in the data flows was transformed into a descriptive token and incorporated into an alphanumeric string that includes: sex and age, admission codes (ICD-9-CM for diagnosis and procedure, hospital identifier), administration of hormone therapy, radiotherapy, and chemotherapy (before or after the date of diagnosis), ATC codes for antineoplastic drugs, cause of death, and the topography-morphology pair derived from pathological anatomy, selected according to the rules described above. The tokens were ordered chronologically, separated by the symbol |. Any missing items in the sequence (for example, a patient who did not receive radiotherapy) were replaced by a placeholder m (missing).
The maximum observed length of the LN-PDTA string in the analysed data was 16 tokens, but this value may vary depending on the completeness and clinical complexity of individual cases.
In the case of procedures performed on the same date, the order established by convention was as follows: chemotherapy, radiotherapy, hormone therapy, topography-morphology pair, ATC code of the specific drug, cause of death, cause of admission. If, on the same date, there were multiple procedures of the same type, these were listed in alphabetical order.
Table 1 shows two examples illustrating how the concatenation of events tracked from the dataflows in chronological order contributes to generating the LN-PDTA alphanumeric string.

Development and application of the predictive model

For the application of the algorithm, the dataset was divided into 80% for model training (training set) and 20% for validation (test set), in order to ensure adequate representation of all the classes under consideration. Each LN-PDTA in the training set was matched with the CR topo-morpho (gold standard). The division was carried out by stratification according to topography, in order to ensure adequate representation of the most common cancer sites in both sets.
The prediction of the topography-morphology combination was carried out using an LSTM-based deep learning model¹⁷, designed to learn semantic and sequential relationships within the natural language textual representation of diagnostic-therapeutic-care pathways. The LSTM architecture consists of recursively connected memory blocks, each containing self-connected memory cells and three multiplicative gates (input, output, and forget gates). These gates act as write, read, and reset mechanisms, enabling the model to effectively manage relevant information along the sequence and avoid loss of context. The use of an LSTM architecture, already well-established in the literature, both in other oncology applications¹⁸ and in studies based on information flows and/or electronic health records¹⁹, was deemed methodologically appropriate for the analysis of sequential data and for the management of long-term dependencies thanks to the gating mechanism (in particular the forget gate), which allows the retention and updating of information to be modulated over time.
For each patient, as a first step, the LN-PDTA text was tokenised and encoded into numerical sequences using a vocabulary constructed from the entire corpus, in which each token was assigned a unique index.
The encoded dataset was encapsulated in a custom class, containing the sequence of tokens and the numerical labels corresponding to topography and morphology.
The model (TumorModel) consists of:

one embedding layer that projects the tokens into a dense 64-dimensional space;
one LSTM with 128 hidden layers that process the sequence;
two final fully connected layers: one dedicated to topography classification and one to morphology classification, each with an output dimension equal to the number of classes.

During the forward pass, the token sequence is transformed via embedding and processed by the LSTM; finally, the vector of the last hidden state is used to produce the two predictions (topography and morphology) in parallel.
Optimisation was carried out using the Adam algorithm with a learning rate of 0.001. The loss function used is the sum of two CrossEntropyLoss functions, one for each of the two labels. The model was trained for 10 epochs.²⁰ During each epoch, the model was set to training mode and, for each batch from the train_loader, the following operations were performed:

resetting the gradient;
predicting the ‘topo’ and ‘morpho’ classes;
calculating the total loss as the sum of the two cross-entropies;
backpropagation and updating the model weights.

The training involved monitoring the aggregate loss at each epoch in order to verify the stability of the learning process. The evaluation phase was conducted separately on the test set, using the CR labels as the gold standard.
The analyses were performed using Python (version 3.10) and the PyTorch (version 2.1.0)²¹ and scikit-learn (version 1.3.1)²² libraries.

Performance evaluation metrics

The performance of the LSTM model was evaluated on the test set for both prediction classes: topography and morphology. For each batch of the test set, the gold standard labels and the model’s predictions for topography and morphology were collected. Accuracy – defined as the percentage of correct predictions out of the total number of observations – was calculated both separately for the two components, reflecting the model’s multi-task nature, and for the prediction of the entire topography-morphology combination, the main objective of the study. In addition, aggregate metrics of precision, recall and F1-score were calculated using the relevant functions from the scikit-learn library.²² For each topo-morpho combination, a contingency table was constructed by measuring:

precision (positive predictive value): the ability to make correct predictions whilst avoiding false positives (i.e., subjects incorrectly assigned to that topo-morpho class instead of their true one);
recall (sensitivity): the ability to identify all instances belonging to a class;
F1-score: the harmonic mean of precision and recall, which provides a balanced indicator of performance for each class.

For an overall assessment, averages of these metrics were calculated by assigning the same weight to all individual classes (of topography and morphology), regardless of their frequency. Furthermore, 95% confidence intervals were calculated using the bootstrap method.

Analysis of token salience based on topography and morphology

To understand which tokens have the greatest influence on predictions, a token-level saliency analysis was carried out separately for each topography and morphology class, using a function based on hooks applied to the model. The average saliency of each token per class was normalised between 0 and 1 to facilitate comparison. This process identified the 20 most influential tokens for each topography and morphology class, highlighting the words or textual symbols most relevant to the model’s decision-making process.

Comparison with a deterministic approach

The performance of the proposed algorithm was compared with that of a static, deterministic approach,²³ according to which, in the test set, the predicted topo-morpho string is assigned only if there is a match with labelled strings present in the training set. In this scenario, the algorithm does not consider any form of partial or approximate similarity and only recognises exact matches. This comparison allowed us to assess the effectiveness of the proposed model in handling the heterogeneity and variability present in real-world data, highlighting the advantages of the approach adopted.

Results

Use and coverage of healthcare sources in the cohort

Between 1 January 2017 and 31 December 2018, 34,168 new cases were identified among residents of the ATS Milano area who met the previously defined inclusion criteria. Of these, 49% (N. 16,723) were male and 51% (N. 17,445) were female. The distribution by age group was as follows: 0.9% (N. 290) under the age of 18, 2.0% (N. 691) aged between 18 and 34, 16.4% (N. 5,608) aged between 35 and 54, 45.1% (N. 15,412) aged between 55 and 74, and 35.6% (N. 12,167) aged 75 or over.
Table 2 shows access to different types of healthcare services. In the 180 days preceding or following the date of diagnosis, 79% of the cohort (N. 26,859) were hospitalised at least once for cancer-related reasons; of these, 82% (65% of the total, N. 22,110) underwent surgery; as regards non-surgical treatments, 54% of patients (N. 18,389) underwent chemotherapy (11% before, 50% after surgery), 17% (N. 5,668) underwent hormone therapy and 27% (N. 9,087) underwent radiotherapy. For 12% (N. 4,119), dispensing of specific anti-cancer drugs was recorded in the pharmaceutical data streams; 16% (N. 5,598) died from cancer; 66% (N. 22,481) had a radiology report containing the anatomical code; 56% (N. 19,139) have a radiology report containing the morphological code.
Each case in the cohort was thus associated with the topo-morpho string as assigned by the surveyors (gold standard) and the healthcare services string (LN-DCTP), comprising 16 tokens, corresponding to the maximum number of healthcare services recorded among the patients in the cohort.

Model performance and comparison of cancer sites

The dataset was divided into a training set, comprising 80% of the cases in the cohort (N. 27,424), and a validation set comprising the remaining 20% (N. 6,744).
Table 3 shows the performance of the proposed algorithm on the validation dataset and for specific sites (breast, colon and rectum, prostate, lung, and biliary tract). The proposed algorithm correctly predicts the topography in 89% of cases and the morphology in 59%; in 56% of cases, the combination of topography and morphology is correct. When the analysis is restricted to breast cancers, the algorithm performs better: it correctly predicts the topography in almost all cases (98.5%), and manages to predict both topography and morphology simultaneously in 73% of cases. For lung cancers, the second most common type, the topography is correct in 94% of cases, the morphology in 58%, and both in 56%. For prostate cancers, the model’s performance is 54.7% overall, with a more accurate prediction of topography (97.8%) and a prediction of morphology of 55.2%. For bile duct cancers, however, the algorithm performs less well than on the entire validation dataset, with correct prediction of topography in 52%, morphology in 61%, and both in 43%.
Compared to the proposed algorithm, the static and deterministic approach – which assigns a topo-morpho string based on exact matches – performs less well and, in 26% of cases, fails to assign a topography and morphology to the instance, as the strings for these individuals are not present in the vector matrix. These cases are, however, recovered by the proposed algorithm, which uses an LSTM network to predict topography and morphology even in the absence of exact matches.

Table 4 shows the model’s performance in terms of average recall (sensitivity), precision, and F1-score. The proposed algorithm achieves an average sensitivity (across all topo-morpho classes) of 56%, with a precision of 56%, resulting in an average F1-score of 56%. For breast cancers, the algorithm shows an increase in both sensitivity and precision. In cases of colorectal and lung cancers, the results are generally more satisfactory compared to the entire validation dataset. For prostate and bile duct cancers, sensitivity and precision are in line with the values observed across the entire dataset, with no significant deviations.
The performance of the static algorithm is generally lower in terms of F1-score, with an average sensitivity showing a significant reduction to 8% and a precision of 34%.

Performance evaluation based on the probability of a match

The latest series of analyses aims to answer the following question: if only the topo-morpho labels assigned via strings for which the probability associated with the prediction exceeds a threshold of x% were considered valid, how would the model’s performance change?
Table 5 shows the algorithm’s performance for increasing probability thresholds: 25%, 50%, 75%, and 90%. When a 25% probability threshold is applied, the metrics are almost entirely consistent with those of the base model, indicating that the majority of predictions are made with probabilities in this range. By increasing the minimum threshold to 75% probability, the number of cases being assigned a label is 1,961 (29% of the validation dataset), with an average sensitivity of 23%, a precision of 82%, and an F1-score of 36%. Finally, by further restricting the analysis to strings with at least a 90% probability of a correct prediction, predictions are obtained for 472 subjects (7% of the validation dataset). Precision reaches a value of over 90%, indicating a low rate of false positives, at the cost of a significant drop in recall (6%), resulting in an F1-score of 11%.

Token saliency

Finally, the saliency analysis enabled to identify the most informative sources in the composition of the LN-PDTA sequence (Figure 1): the three most relevant dataflows, based on the normalised average importance of the tokens derived from them, were found to be PA, HDR, and ReNCaM for both topography and morphology prediction.

Discussion

The main contribution of this work is the introduction of an automated coding algorithm capable of jointly predicting the topography and morphology of tumours by integrating information from healthcare dataflows. In the test set, the model achieved an overall accuracy of 56% for the topography-morphology combination, a significant result in the context of a multi-class classification problem with over 400 possible categories. Although limitations are observed in the prediction for less frequent and more heterogeneous sites, such as the bile ducts and other low-incidence sites, for which the proportion of correct classifications drops to 43%, superior performance is achieved for certain high-incidence cancer sites, such as the breast and colorectal region (73% and 61%, respectively).
Given the complexity of the cancer coding process, which has traditionally been entrusted to expert coders, these results represent a potential step forward compared with the algorithms described in the literature to date:^11,14,16,24 these focus on predicting topography, achieving sensitivity rates ranging from 58% to 78%, and rely on manual classification to assign the histological diagnosis. Some studies have explored the automatic classification of morphology based on pathology reports, adopting rule-based approaches or supervised models, often focused on single languages or national contexts and on relatively homogeneous datasets, limiting themselves to a subset of sites and morphologies.²⁵ Other studies propose coding support systems that suggest possible ICD-O codes for topography and morphology, without, however, explicitly addressing the problem of joint prediction as a high-dimensional multi-class task.²⁶
The LN-PDTA string, by leveraging the sequential nature and context of healthcare data flows, summarises each patient’s clinical journey in a ‘phrase’ of tokens that describes the key events (admissions, treatments, PA reports, death, etc.) in chronological order. This allows for the simultaneous integration of different types of information – diagnostic, therapeutic, and follow-up – going beyond the use of isolated individual codes typical of many previous algorithms.
On the other hand, the strengths associated with using the string in an LSTM network are highlighted by comparison with a deterministic approach, which requires an exact match between performance strings: whilst the deterministic algorithm fails to assign any code in around a quarter of cases (26%), the LSTM network is always able to propose a prediction, even in the presence of partial or inconsistent information, with better overall performance metrics (recall 56%, precision 56%).
A stratified analysis based on the reliability of predictions derived from performance metrics suggests a potential strategy for the practical implementation of the algorithm in the day-to-day operations of cancer registries. By limiting the use of the model to cases with a medium-to-high probability of correct prediction (≥75%), an accuracy of 82% is achieved, suggesting a targeted use of the model to support manual coding rather than as a fully autonomous tool. Although manual case detection remains essential, a reduction of even just 20-30% in manual work, particularly during the stages of data submission to national and international agencies, can have a significant positive impact. This would allow registries to focus resources on less frequent cases requiring more complex coding, thereby improving the overall efficiency of the process.
The timeliness of cancer registry coding is a common priority across all countries, particularly in Western nations, as it enables the effective monitoring and surveillance of cancer. According to a recently published study, based on US SEER data, it takes approximately 28 months from the end of the reference diagnosis year until the data publication.11 The Centers for Disease Control and Prevention (CDC) have highlighted the need for faster detection of new cancer cases and are promoting initiatives aimed at modernising and accelerating the entire registration process.²⁷
This is a time when big data and artificial intelligence models are revolutionising every aspect of daily life; the current structure of cancer registries, which relies on the manual work of expert data collectors, seems to be a perfect example of where these technological advancements could lead to a significant improvement in performance.
The model proposed in this study has demonstrated that the integrated and automated use of various healthcare data sources (including pathology reports, hospital discharge records, outpatient and pharmaceutical data flows, as well as the cause-of-death registry) can significantly help to reduce the workload of cancer registries, whilst also speeding up the automated coding of cases. A further advantage lies in the system’s flexibility: periodic retraining can be carried out to reflect changes in clinical practice, such as the introduction of new drugs (for example, molecularly targeted therapies or recently marketed immunotherapies) or new radiotherapy protocols, as well as revisions or updates to guidelines or registration rules.
The use of clinical pathways in natural language (LN-PDTA), combined with a neural network, would allow, by determining the appropriate course of action at different reliability thresholds, the automation of a portion of topo-morpho coding, significantly reducing processing times and lightening the workload of coders. In a risk-based operational framework, high-confidence predictions can be implemented directly, whilst complex cases remain under human supervision, achieving a more efficient balance between productivity and quality.
The increase in efficiency serves two strategic and mutually complementary objectives: extending the coverage of registries to areas that are currently only partially (or not at all) covered, and improving the timeliness of registration in areas where there is a demand for real-time data – for example, those under high environmental pressure, where rapid and updatable estimates are needed to monitor the health status of the population. In areas exposed to high emission levels or specific industrial sources, the availability of an automated coding system would provide frequent incidence estimates by location and, where appropriate, by demographic group; this would allow for the calculation of indicators stratified by territory and period, enabling near-continuous surveillance and the timely identification of deviations from expected trends, thereby guiding prevention and mitigation measures.
This tool can be adapted to different contexts, as it is based on the local mapping of dataflows (inpatient admissions, outpatient care, pharmacy, pathology, mortality) and on a scheduled re-training process that takes into account regional characteristics, clinical and organisational changes, and the introduction of new drugs or protocols. The roll-out process can be progressive, starting with established registries to refine rules and tools, moving on to assisted transfer to registries in the start-up or consolidation phase, then to priority roll-out in environmentally critical areas, and finally to the evolutionary maintenance of the model.
To maximise public benefit, the LN-PDTA can be enriched with imaging and laboratory data, which are currently already used in the assessment of clinical care pathways alongside CR²⁸ data, or with address geocoding and links to environmental layers (air quality indicators, proximity to emission sources, vulnerability measures)^29,30.
The actual effectiveness can be assessed by examining the expansion of geographical coverage and the population covered, the reduction in the delay between the date of occurrence and the availability of usable estimates, the agreement between automated coding and the gold standard, and the gradual reduction in the gaps in timeliness and completeness between priority and non-priority areas. These results require unified governance of coding rules, the training of data collectors and data managers in the use of algorithmic triage, and standardised reporting to decision-makers, capable of transparently communicating the quality, limitations, and uncertainties of the produced estimates.
However, certain limitations still remain. Rare cancers follow highly heterogeneous and non-reproducible pathways, with limited data available during the training phase: the model therefore struggles to stabilise robust representations, and performance is negatively affected. The quality and completeness of the dataflows are crucial: fragmented or missing data hinder the construction of coherent LN-PDTAs and reduce predictive ability. Validation was conducted in a single geographical area, limited to the ATS Milano; this requires multicentre verification on populations with different sociodemographic compositions and within healthcare systems organised differently, to confirm generalisability and any necessary adaptations. Furthermore, performance depends on the period analysed: changes in clinical practice, organisational pathways and the availability/use of drugs may alter the observed distributions and, consequently, the stability of the model. To maintain accuracy over time, an iterative maintenance programme involving retraining on cohorts from different years is essential, so as to account for changes in data flows and the current clinical and care landscape.

Conclusions

This study provides an operational framework that is immediately scalable and transferable to different geographical areas: the natural language representation of clinical care pathways (LN-PDTA), combined with a neural network, correctly assigns over half of the topography and morphology codes. When integrated into a risk-based decision-making process that takes into account increasing reliability thresholds, it can successfully semi-automate or automate a significant proportion of topo-morpho coding in cancer registries, significantly reducing registration times and freeing up specialist resources for activities with greater added value.
This tool, which can be refined through progressive data enrichment, enables public health practitioners to expand cancer registries, accelerate registration and provide rapid and reliable estimates. This results in strengthened surveillance and a greater capacity to promptly guide health protection policies, whilst maintaining high standards of quality, transparency and reproducibility.

Conflicts of interest: none declared.

Funding: this study was funded by the Italia Ministry of Health as part of the National Plan for Complementary Investments (PNC), under the ‘Health, Environment, Biodiversity and Climate’ scheme (Project Code E19I23001260001). Associated template: 231000. Lead Partner/Proponent: Puglia Region through AReSS Puglia (Regional Council Decision 1199/2023). Principal Investigator: Lucia Bisceglia.
It should be noted that the funding body (Ministry of Health) played no role in the design and conduct of the study, nor in the collection, management, analysis and interpretation of the data.

Authorship: study conception and design: Antonio Giampiero Russo; data collection, analysis, or interpretation: all authors; drafting of the manuscript: all authors; critical review of the manuscript for significant intellectual content: all authors; statistical analysis: Adele Zanfino, Carlotta Buzzoni; funding: Antonio Giampiero Russo; administrative, technical, or material support: Adele Zanfino, Carlotta Buzzoni; supervision: Antonio Giampiero Russo, Carlotta Buzzoni.

References

Bouchardy C, Rapiti E, Benhamou S. Cancer registries can provide evidence-based data to improve quality of care and prevent cancer deaths. Ecancermedicalscience 2014;8:413. doi: 10.3332/ecancer.2014.413
Esteban D, Whelan S, Laudico A, Parkin DM (eds). Manual for Cancer Registry Personnel. IARC Technical Report No. 10. Lione, IARC, 1995. Available from: https://publications.iarc.fr/Book-And-Report-Series/Iarc-Technical-Publications/Manual-For-Cancer-Registry-Personnel-1995 (last accessed: 06.10.2025).
Parkin DM, Chen VW, Ferlay J, Galceran J, Storm HH, Whelan SL (eds). Comparability and Quality Control in Cancer Registration. IARC Technical Report No. 19. Lione, IARC, 1994. Available from: https://publications.iarc.fr/Book-And-Report-Series/Iarc-Technical-Publications/Comparability-And-Quality-Control-In-Cancer-Registration-1994 (last accessed: 06.10.2025).
Martos C, Giusti F, Van Eycken E, Visser O. A common data quality check procedure for European cancer registries. European Commission. Ispra, JRC, 2023. Available from: https://www.encr.eu/sites/default/files/Recommendations/JRC132486_cancer_data_quality_checks_procedure_report_2.0.pdf
World Health Organization. International classification of diseases for oncology, 3rd Edition (ICD-O-3). Ginevra, WHO, 2013. Available from: https://www.who.int/standards/classifications/other-classifications/international-classification-of-diseases-for-oncology (last accessed: 06.10.2025).
Hofferkamp J (ed). Standards for Cancer Registries Volume III. Standards for Completeness, Quality, Analysis, Management, Security and Confidentiality of Data. Springfield (IL), North American Association of Central Cancer Registries, 2008. Available from: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/ssr-anapath/Standards%20for%20Cancer%20Registries,%20Volume%20III.pdf
Ministero della Salute. Piano Oncologico Nazionale: documento di pianificazione e indirizzo per la prevenzione e il contrasto del cancro 2023-2027. Available from: https://www.osservatorionazionalescreening.it/sites/default/files/allegati/PON%202023-2027.pdf (last accessed: 06.10.2025).
Nguyen AN, Moore J, O’Dwyer J, Philpot S. Automated Cancer Registry Notifications: Validation of a Medical Text Analytics System for Identifying Patients with Cancer from a State-Wide Pathology Repository. AMIA Annu Symp Proc 2017;2016:964-73.
Simonato L, Canova C, Corrao G, Costa G, Tessari R. Ricerca e sviluppo di algoritmi: la definizione di alcune patologie neoplastiche. Epidemiol Prev 2008;32(3) Suppl:94-96.
Ferretti S, Guzzinati S, Zambon P, et al. Stima dell’incidenza del carcinoma mammario attraverso il flusso dei ricoveri ospedalieri: confronto con i dati dei Registri tumori. Epidemiol Prev 2009;33(4-5):147-53.
Chen HS, Negoita S, Schwartz S, et al. Toward real-time reporting of cancer incidence: methodology, pilot study, and SEER Program implementation. J Natl Cancer Inst Monogr 2024;2024(65):123-31. doi: 10.1093/jncimonographs/lgae024
Langhout SAM, Hermans SJF, Smit AJT, et al. Real-time data in cancer registries: Validation of an automated data extraction system. iScience 2025;28(8):113056. doi: 10.1016/j.isci.2025.113056
Nguyen AN, Moore J, O’Dwyer J, Philpot S. Assessing the Utility of Automatic Cancer Registry Notifications Data Extraction from Free-Text Pathology Reports. AMIA Annu Symp Proc 2015;2015:953-62.
Martina S, Ventura L, Frasconi P. Classification of Cancer Pathology Reports: A Large-Scale Comparative Study. IEEE J Biomed Health Inform 2020;24(11):3085-94. doi: 10.1109/JBHI.2020.3005016
Tagliabue G, Maghini A, Fabiano S, et al. Consistency and accuracy of diagnostic cancer codes generated by automated registration: comparison with manual registration. Popul Health Metr 2006;4:10. doi: 10.1186/1478-7954-4-10
Tognazzo S, Andolfo A, Bovo E, et al. Quality control of automatically defined cancer cases by the automated registration system of the Venetian Tumour Registry. Quality control of cancer cases automatically registered. Eur J Public Health 2005;15(6):657-64. doi: 10.1093/eurpub/cki035
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735
Kaddes M, Ayid YM, Elshewey AM, Fouad Y. Breast cancer classification based on hybrid CNN with LSTM model. Sci Rep 2025;15(1):4409. doi: 10.1038/s41598-025-88459-6.
Kiser AC, Shi J, Bucher BT. An explainable long short-term memory network for surgical site infection identification. Surgery 2024;176(1):24-31. 10.1016/j.surg.2024.03.006
Zhang Z, Sabuncu MR. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv Neural Inf Process Syst 2018;32:8792-802.
Paszke A, Gross S, Massa F, et al. PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 2019;32: 8024-8035. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. and Garnett, R., Eds., Advances in Neural Information Processing Systems, Neural Information Processing Systems Foundation Inc. (NeurIPS), Vancouver, 8024-8035.
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. J Mach Learn Res 2011;12:2825-30.
Sayers A, Ben-Shlomo Y, Blom AW, Steele F. Probabilistic record linkage. Int J Epidemiol 2016;45(3):954-64. doi: 10.1093/ije/dyv322
Qiu JX, Yoon HJ, Fearn PA, Tourassi GD. Deep Learning for Automated Extraction of Primary Sites from Cancer Pathology Reports. IEEE J Biomed Health Inform 2018;22(1):44-251. doi: 10.1109/JBHI.2017.2700722
Hammami L, Paglialonga A, Pruneri G, et al. Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach. J Biomed Inform 2021;116:103712. doi: 10.1016/j.jbi.2021.103712
Villena F, Báez P, Peñafiel S, Rojas M, Paredes I, Dunstan J. Developing and Validating an Automatic Support System for Tumor Coding in Pathology Reports in Spanish. JCO Clin Cancer Inform 2025;9:e2400124. doi: 10.1200/CCI.24.00124
US Centers for Disease Control and Prevention. National Program of Cancer Registries. Data Modernization. 2024. Available from: https://www.cdc.gov/national-program-cancer-registries/data-modernization/index.html (last accessed: 06.10.2025).
Frammartino B, Crocetti E, Buzzoni C, Cereda D, Russo AG. Valutazione dell’appropriatezza della prescrizione del PSA come test di screening opportunistico del carcinoma prostatico: i dati dell’Agenzia di Tutela della Salute della Città Metropolitana di Milano. Epidemiol Prev 2025;49(5-6):415-23. doi: 10.19191/EP25.5-6.001
Murtas R, Andreano A, Greco MT, Tunesi S, Russo AG. Cancer incidence and congenital anomalies evaluation in the contaminated sites of Sesto San Giovanni – the SENTIERI Project. Ann Ist Super Sanita 2019;55(4):345-50. doi: 10.4415/ANN_19_04_07
Tunesi S, Bergamaschi W, Russo AG. Estimated number of deaths attributable to NO2, PM10, and PM2.5 pollution in the Municipality of Milan in 2019. Epidemiol Prev 2024;48(1):12-23. doi: 10.19191/EP24.1.A660.001. Erratum in: Epidemiol Prev 2024;48(4-5):388.

Vai all'articolo su epiprev.it Versione Google AMP