Poster Presentation Clinical Oncology Society of Australia Annual Scientific Meeting 2022

Evaluating automated data extraction for lung cancer pathology reports in NSW Cancer Registry (#274)

Hanyu Chen 1 , Sheena Lawrance 1 , Claire Cooke-Yarborough 1
  1. Cancer Institute NSW, Sydney, NSW, Australia

Aims

The New South Wales Cancer Registry (NSWCR) continues to evaluate artificial intelligence (AI) software, which uses Natural Language Processing and complex algorithms to extract data from electronic pathology reports. This project aimed to evaluate its ability to extract selected lung cancer data variables, and to develop a semi-automatic data extraction approach for lung cancer reports. 

Methods

Twenty-two standard data items in Structured Reporting Protocol for Lung Cancer 2nd edition were mapped to data items in the software. Those evaluated include macroscopic features (e.g. laterality), microscopic features (e.g.  grade), ancillary markers (e.g. immunohistochemistry and biomarkers) and TNM stage. Automatic and manual extractions (gold standard) were compared, and discrepancies were evaluated.

Results

Agreement scores were high when the data domain was straightforward, e.g. an 89% agreement for a ‘positive’/ ‘negative’ result for Napsin A staining. Data items with specific titles such as T stage and TTF-1 also had higher agreement scores (88% and 87% respectively). Where there were discrepancies or multiple values extracted, a null value resulted. AI software could not recognise some texts and symbols such as ‘washing’/’brushing’ as procedure and ‘+’/‘-‘ in results of ancillary markers. In contrast to earlier evaluations (breast, colorectal and prostate), N stage only had an agreement score of 38%, as lymph nodes denoted as ‘N1’ or ‘N2’ were mistakenly extracted as N stage values.

Conclusions

The results revealed the impact of the variability in report structure and descriptive phraseology complicated by the co-existence of several diagnosis and test results. Based on the results, the AI engine is being further improved. Where data items can be reliably extracted to pre-populate data fields, especially those of specific interest such as staging and biomarker status, there will be an added advantage for reporting at a population level.