Breast arterial calcifications (BAC) are increasingly recognized as indicative markers for cardiovascular disease (CVD). In this study, we manually annotated BAC areas on 3,330 mammograms, forming the foundational dataset for developing a deep learning model to automate assessment of BAC. Using this annotated data, we propose a semi-supervised deep learning approach to analyze unannotated mammography images, leveraging both labeled and unlabeled data to improve BAC segmentation accuracy. Our approach combines the U-net architecture, a well-established deep learning method for medical image segmentation, with a semi-supervised learning technique. We retrieved mammographic examinations of 6,000 women (3,000 with confirmed CVD and 3,000 without) from the screening archive to allow for a focused study. Utilizing our trained deep learning model, we accurately detected and measured the severity of BAC in these mammograms. Additionally, we examined the time between mammogram screenings and the occurrence of CVD events. Our study indicates that both the presence and severity (grade) of BAC, identified and measured using deep learning for automated segmentation, are crucial for primary CVD prevention. These findings underscore the value of technology in understanding the link between BAC in mammograms and cardiovascular disease, shaping future screening and prevention strategies for women's health.
Silicosis is a type of occupational lung disease or pneumoconiosis that results from the inhalation of crystalline silica dust that can lead to fatal respiratory conditions. This study aims to develop an online platform and benchmark radiologists' performance in diagnosing silicosis. Fifty readers (33 radiologists and 17 radiology trainees) interpreted a test-set of 15 HRCT cases. The median AUROC for all readers combined was 0.92 (0.93 for radiologists and 0.91 for trainees). No statistical differences were observed among the radiologists and trainees for their performance. Moderate agreement was recorded among readers for the correct diagnosis of silicosis (κ=0.57), however, there was considerable variability (κ<0.2) in the accurate detection of irregular opacities and ground glass opacities. Our online platform shows promise in providing tailored education to clinicians and facilitating future works of long-term observer studies and development of educational solutions to enhance the diagnostic accuracy of silicosis detection.
The global radiomic signature extracted from mammograms can indicate that malignancy appearances are present within an image. This study focuses on a set of 129 screen-detected breast malignancies, which were also visible on the prior screening examinations (i.e., missed cancers based on the priors). All cancer signs on the prior examinations were actionable based on the opinion of a panel of three experienced radiologists, who retrospectively interpreted the prior examinations (knowing that a later screening round had revealed a cancer). We investigated if the global radiomic signature could differentiate between screening rounds: when the cancer was detected (“identified cancers”), from the round immediately before (“missed cancers”). Both identified cancers and “missed cancers” were collected using a single vendor technology. A set of “normals”, matched based on mammography units, was also retrieved from a screening archive. We extracted a global radiomic signature, containing first and second-order statistics features. Three classification tasks were considered: (1) “identified cancers” vs “missed cancers”, (2) “identified cancers” vs “normals”, (3) “missed cancers” vs “normal”. To train and validate the models, leave-one-case-out cross-validation was used. The classifier resulted in an AUC of 0.66 (95%CI=0.60-0.73, P<0.05) for “missed cancers” vs “identified cancers” and an AUC of 0.65 (95%CI=0.60-0.69, P<0.05) for “normals” vs “identified cancers”. However, the AUC of the classifier for differentiating “normals” from “missed cancers” was at chance-level (AUC=0.53 (95%CI=0.48-0.58, P=0.23). Therefore, eliminating some of these “missed” cancers in clinical practice would be very challenging as the global signal of the malignancy that help with a diagnosis, are at best weak.
This study aimed at conducting a review of the prior mammograms of screen-detected breast cancers, found on full-field digital mammograms based on independent double reading with arbitration. The prior mammograms of 607 women diagnosed with breast cancer during routine breast cancer screening were categorized into “Missed”, “Prior Vis”, and “Prior Invis” . The prior mammograms of “Missed” and “Prior Vis” cases showed actionable and non-actionable visible cancer signs, respectively. The “Prior Invis” cases had no overt cancer signs on the prior mammograms. The percentage of cases classified as “Missed”, “Prior Vis”, and “Prior Invis” categories were 25.5%, 21.7%, 52.7%, respectively. The proportion of high-density cases showed no significant differences among the three categories (p-values<0.05). The breakdown of cases into “Missed”, “Prior Vis”, and “Prior Invis” categories did not differ between invasive (488) and in-situ (119) cases. In the invasive category, the progesterone (p-value=0.015) and estrogen (p-value=0.007) positivity and the median ki-67 score (p-value=0.006) differed significantly among the categories with the “Prior Invis” cases exhibiting the highest percentage of hormone receptors negativity. In the invasive cases, the percentage of cancers graded as 3 (i.e., more aggressive) were significantly more in the “Prior Invis” category compared to both “Missed” and “Prior Vis” categories (both p-values<0.05). The status of receptors and breast cancer grade for the in-situ cases did not differ significantly among the three categories. Prior images categorization can predict the aggressiveness of breast cancer. Techniques to better interrogate prior images as shown elsewhere may yield important patient outcomes.
The initial impressions about the presence of abnormality (or gist signal) from some radiologists are as accurate as decisions made following normal presentation conditions while the performance from others is only slightly better than chance-level. This study investigates if there is a subset of radiologists (i.e., “super-gisters”), whose gist signal is more reliable and consistently more accurate than others. To measure the gist signal, images were presented for less than a half-second. We collected the gist signals from thirty-nine radiologists, who assessed 160 mammograms twice with a wash-out period of one month. Readers were categorized as “super-gisters” and “others” by fitting a mixture of Gaussian models to the average Area Under Receiver Operating Characteristics curve (AUC) values of radiologists in two rounds. The median intra-class correlation (ICC) for the “supergisters” was 0.63 (IQR: 0.51-0.691) while the median ICC for the “others” was 0.51 (IQR: 0.42-0.59). The difference between the two groups was significant (p=0.015). The number of mammograms interpreted by the radiologist per week did not differ significantly between “super-gisters” and others (medians of 237 versus 200, p=0.336). The linear mixed model, which treated both case and reader as random variables showed that only “super-gisters” can perceive the gist of the abnormal on negative prior mammograms, from women who developed breast cancer. Although detecting gist signal is noisy, a sub-set of readers have the superior capability in detecting the gist of the abnormal and only the scores given by them are useful and reliable for predicting future breast cancer.
This study investigated the possibility of building an end-to-end deep learning-based model for the prediction of a future breast cancer based on prior negative mammograms. We explored whether the probability of abnormal class membership given by the model was correlated with the gist of the abnormal as perceived by radiologists in negative prior mammograms. To build the model, an end-to-end network, previously developed for breast cancer detection, was fine-tuned for breast cancer prediction by using a dataset containing 650 prior mammograms from women, who were diagnosed with breast cancer in a subsequent screening and 1000 cancer-free women. On a set of 630 test images, the model achieved an AUC of 0.73. For extracting gist responses, 17 experienced radiologists were recruited, viewed mammograms for 500 milliseconds and gave a score showing whether they would categorize the case as normal or abnormal on the scale of 0- 100. The image set contained 40 normal, 40 current cancer images along with 72 prior mammograms from women who would eventually develop a breast cancer. We averaged the scores from 17 readers and produced a single score per image. The network achieved an AUC of 0.75 for differentiating prior images from normal images. For 72 prior mammograms, the output of the network was significantly correlated with the strength of the gist of the abnormal as perceived by experienced radiologists (Spearman’s correlation=0.84, p<0.01). This finding suggested that the network successfully learned the representation of the gist of the abnormal in prior mammograms as perceived by experienced radiologists.
This study aimed to investigate the effect on reading performance of how long radiologists have been awake (“time awake”) and the number of hours they slept at night (“hours slept at night”) before a reading session. Data from 133 mammographic reading assessments were extracted from the Breast Screen Reader Assessment Strategy database. Analysis of covariance was performed to determine whether sensitivity, specificity, lesion sensitivity, ROC, and JAFROC were influenced by the time awake and the hours slept at night. The results showed that less experienced radiologists’ performance varied significantly according to the time awake: lesion sensitivity was significantly lower among radiologists who performed readings after being awake for less than 2 h (44.6%) than among those who had been awake for 8 to <10 h (71.03%; p = 0.013); likewise, the same metric was significantly lower among those who had been awake for 4 to <6 h (47.7%) than among those who had been awake for 8 to <10 h (71.03%; p = 0.002) and for 10 to <12 h (63.46%; p = 0.026). The ROC values of the less experienced radiologists also seemed to depend on the hours slept at night: values for those who had slept ≤6 h (0.72) were significantly lower than for those who had slept >6 h (0.77) (p = 0.029). The results indicate that inexperienced radiologists’ performance may be affected by the time awake and hours slept the night before a reading session.
This study explored whether having a better performance in usual presentation condition, more years of experience, and higher volume of annual mammogram assessment make a radiologist better at perceiving the gist of the abnormal on a mammogram. Nineteen radiologists were recruited for two experiments. In the first one (gist experiment), the initial impressions of the radiologists were collected based on a half-second image presentation on a scale of 0 (confident normal) and 100 (confident abnormal). In the second one, radiologists viewed similar set of cases using BreastScreen Reader Assessment Strategy platform and rated each case on a scale of 1-5. Using Spearman correlation, we explored if the area under receiver operating characteristics curve (AUC) in two experiments were correlated. Radiologists were also grouped based on variables describing their experience levels and workload and their performance in both experiments were compared among the groups. The AUC values in the gist experiment was not significantly correlated to the AUC values in the normal reporting experiment (Spearman correlation=0.183, p-value=0.453). Radiologists’ performances under the normal reporting conditions, was linked to the number of cases per week (p=0.044), number of hours per week currently spent reading mammograms(p=0.028), and number of years they have been reading mammograms (p=0.041). However, none of the variables reached a p-value<0.05 for the AUC of the gist experiment. The results suggest that further studies should be done to establish relationships between the gist response and radiologists’ characteristics since being a high-performing radiologist, highly experienced radiologist, or reading high volume of mammograms does not indicate superior capability when perceiving the gist of the abnormal.
This study explored the possibility of using the gist signal (radiologists’ first impression about a case) for improving the performance of two recently developed deep learning-based breast cancer detection tools. We investigated whether by combining the cancer class probability from the networks with the gist signal, higher performance in identifying malignant cases can be achieved. In total, we recruited 53 radiologists, who provided an abnormality score on a scale from 0 to 100 to unilateral mammograms following a 500-millisecond presentation of the image. Twenty cancer cases, 40 benign cases, and 20 normal were included. Two state-ofthe-art deep learning-based tools (M1 and M2) for breast cancer detection were adopted. The abnormality scores from the networks and the gist responses for each observer were fed into a support vector machine (SVM). The SVM was personalized for each radiologist and its performance was evaluated using leave-one-out cross-validation. We also considered the average reader; whose gist responses were the mean abnormality scores given by all 53 readers to each image. The mean and range of AUCs in the gist experiment were 0.643 and 0.492-0.794, respectively. The AUC values for M1 and M2 were 0.789 (0.632-0.892) and 0.814 (0.673-0.897), respectively. For the average reader, the AUC for gist, gist+M1, and gist+M2 were 0.760 (0.617-0.862), 0.847 (0.754-0.928), 0.897 (0.789-0.946). For 45 readers, the performance of at least one of the models improved after aggregating its output with the gist signal. The results showed that the gist signal has the potential to improve the performance of adopted deep learning-based tools.
Numerous factors contribute to radiologist image reading discrepancy and interpretive errors. However, a factor often overlooked is how interpretations might be impacted by the time of day when the image reading takes place—a factor that other disciplines have shown to be a determinant of competency. This study therefore seeks to investigate whether radiologists’ reading performances vary according to the time of the day at which the readings take place. We evaluated 197 mammographic reading assessments collected from the BreastScreen Reader Assessment Strategy (BREAST) database, which included reading timestamps and radiologists’ demographic data, and conducted an analysis of covariance to determine whether time of day influenced the radiologists’ specificity, lesion sensitivity, and jackknife alternative free-response receiver operating characteristic (JAFROC). After adjusting for radiologist experience and fellowship, we found a significant effect of the time of day of the readings on specificity but none on lesion sensitivity or JAFROC. Radiologist specificity was significantly lower in the late morning (10 am–12 pm) and late afternoon (4 pm–6 pm) than in the early morning (8 am–10 am) or early afternoon (2 pm–4 pm), indicating a higher rate of false-positive interpretations in the late morning and late afternoon. Thus, the time of day mammographic image readings take place may influence radiologists’ performances, specifically their ability to identify normal images correctly. These findings present significant implications for radiologic clinicians.
The Breast Imaging and Reporting Data System (BI-RADS) density score is a qualitative measure and thus subject to inter- and intra-radiologist variability. In this study we investigated the possibility of fine-tuning a state-of-the-art deep neural networks for (i) distinguishing fatty breasts (BI-RADS I and II) from dense ones (BI-RADS III and IV), (ii) classifying the low risk group into BIRADS I and II, and (iii) classifying the high risk group into BIRADS III and IV. To do so 3813 images acquired from nine mammography units and three manufacturers were used to train an Inception- V3 network architecture. The network was pre-trained on the ImageNet data set and we trained it on our dataset using transfer learning. Before feeding the images into the input layer of Inception- V3, the breast tissue was segmented from the background and the pectoral muscle was excluded from the image in the mediolateral oblique view. Images were then cropped by using the breast bounding box and resized to make the images compatible with the input layer of the network. The performance of the network was evaluated on a blinded test set of 150 mammograms acquired from 14 mammography units provided by six manufacturers. The reference density value for these images was obtained based on the consensus of three radiologists. The network achieved an accuracy of 92.0% in high versus low risk classification. For the second and third classification tasks, the overall accuracy was 85.9% and 86.1%. When results from all three classifications combined, the networks achieved an accuracy of 83.33% and a Cohen’s kappa of 0.775 (95% CI: 0.694-0.856) for four-point density categorization. The obtained results suggest that a deep learning-based computerized tool can be used for providing BI-RADS density scores.
KEYWORDS: Mammography, Breast, Digital mammography, Imaging systems, Systems modeling, Statistical analysis, Cancer, Monte Carlo methods, Error analysis, Breast cancer
Our objective was to analyze the agreement between organ dose estimated by different digital mammography units and calculated dose for clinical data. Digital Imaging and Communication in Medicine header information was extracted from 52,405 anonymized mammograms. Data were filtered to include images with no breast implants, breast thicknesses 20 to 110 mm, and complete exposure and quality assurance data. Mean glandular dose was calculated using methods by Dance et al., Wu et al., and Boone et al. Bland–Altman analysis and regression were used to study the agreement and correlation between organ and calculated doses. Bland–Altman showed statistically significant bias between organ and calculated doses. The bias differed for different unit makes and models; Philips had the lowest bias, overestimating Dance method by 0.03 mGy, while general electric had the highest bias, overestimating Wu method by 0.20 mGy, the Hologic organ dose underestimated Boone method by 0.07 mGy, and the Fujifilm organ dose underestimated Dance method by 0.09 mGy. Organ dose was found to disagree with our calculated dose, yet organ dose is potentially beneficial for rapid dose audits. Conclusions drawn based on the organ dose alone come with a risk of over or underestimating the calculated dose to the patient and this error should be considered in any reported results.
This study aims to analyze the agreement between the mean glandular dose estimated by the mammography unit (organ
dose) and mean glandular dose calculated using Dance et al published method (calculated dose). Anonymised digital
mammograms from 50 BreastScreen NSW centers were downloaded and exposure information required for the
calculation of dose was extracted from the DICOM header along with the organ dose estimated by the system. Data
from quality assurance annual tests for the included centers were collected and used to calculate the mean glandular
dose for each mammogram. Bland-Altman analysis and a two-tailed paired t-test were used to study the agreement
between calculated and organ dose and the significance of any differences. A total of 27,869 dose points from 40
centers were included in the study, mean calculated dose and mean organ dose (± standard deviation) were 1.47 (±0.66)
and 1.38 (±0.56) mGy respectively. A statistically significant 0.09 mGy bias (t = 69.25; p<0.0001) with 95% limits of
agreement between calculated and organ doses ranging from -0.34 and 0.52 were shown by Bland-Altman analysis,
which indicates a small yet highly significant difference between the two means. The use of organ dose for dose audits
is done at the risk of over or underestimating the calculated dose, hence, further work is needed to identify the causal
agents for differences between organ and calculated doses and to generate a correction factor for organ dose.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.