Les billets libellés: python. Afficher tous les billets.

CBIS-DDSM Mammography Training Data

mercredi 06 juin 2018

I am continuing to work with the CBIS-DDSM datasets and recently decided to take a new direction with the training data. Previously I had been locally segmenting the raw scans into images of varying sizes and writing those images to tfrecords to use as training data. I started by classifying the images by pathology with categorical labels, and while I got decent results using this approach, the models performed terribly on images from different datasets and on full-size images. I suspected the model was using features of the images that were not related to the actual ROIs to make its predictions, such as the amount of contrast or presence of extremely high pixel values.

To address this I started using the masks as labels and training the model to do segmentation of the images into normal and ROI. This had the added advantage of allowing me to exclude images from the DDSM dataset and only use CBIS-DDSM images which eliminated the features I believed the previous models had been relying on, as the DDSM and CBIS-DDSM datasets had substantially different variances, mins, maxes and means. The disadvantage of this approach was that the dataset was double the size due to the fact that the labels are now the same size as the images. 

I started with a dataset of 320x320 images, however models trained on this dataset often had trouble with images which had bright patches running of the edge of the image and images with high contrast, misclassifying the bright patches as positive. To attempt to address this I started training the model on 320x320 images, and then switched to another dataset of 640x640 images after training through 50 or so epochs. 

The dataset of 640x640 images only had 13,000 training examples in it, about 1/3 the number of examples in the 320x320 dataset, but was still larger due to the fact that each example and label is four times the size of the 320x320 images. I considered making another dataset with either more or larger images, but saw that this process could continue indefinitely as I had to keep creating new datasets of larger and larger size.

Instead I decided to create one new dataset which could be used indefinitely, for all purposes. To do this I loaded each image in the CBIS-DDSM dataset into Python. While the JPEGS are RGB, the images are grayscale so I only kept one channel of each image. I Some images have multiple masks, and rather than have multiple versions of each image with different masks, which could confuse the model, I combined all masks for each image into one mask, and then added that as the second channel of each image. In order to be able to save the array as an image I added a third channel of all 0s. Each new images was then saved as a PNG.

The resulting dataset is about 12GB, about four times the size of the largest tfrecords dataset, but the entirety of the CBIS-DDSM dataset (minus a few images which had masks of incorrect sizes and were discarded) is now represented. Now, in my model, I load each full image and then take a random crop of it and use that as training data. Since the mask is part of the image I can use TensorFlow's random crop function to crop the full image, and then separate the channels into the training example and it's label. 

This not only increases the size of the training data set exponentially, but since my model is fully convolutional, I can also easily change the crop size without having to create a new dataset. 

The major problem with this approach is that the mean of the labels is very low - around 0.015 - meaning that only 1% of the pixels have a positive label and the rest are negative. The previous dataset had a mean of 0.05. This will be addressed by raising the cross entropy weight from 20 to 75 so that the model doesn't just predict everything as negative. When creating the images I had trimmed as much background as possible from them to avoid having a large amount of training images of pure black, but still the random cropping produces a large number of images with little to no actual content. 

At the moment I am uploading the data to S3 which should take another couple days. Once this is done I will attempt to train on this new dataset and see if the empty images cause major problems.

Libellés: coding, python, machine_learning, mammography
Aucun commentaire

Update on CBIS-DDSM Training Images

mercredi 06 juin 2018

Even though I only have 1/5 of the images uploaded so far, I decided to do some tests to see if this method would work. It does, but it took quite a bit of tweaking to get performance to reasonable levels.

At first I just plugged the new dataset into the old graph, and this worked but was incredibly slow with the GPU sitting idle most of the time. I tried quite a few different methods to speed the pre-processing bottleneck up, but the solution was simpler than I had thought it would be.

The biggest factor was increasing number of threads in the tf.train.batch from the default of 1. This one change made a huge difference, cutting the training time down to about 1/4 of what it had been.

I also experimented with moving some pre-processing operations around, including resizing the images individually when loading them and after being batched. This had negligible effects, but resizing them individually was slightly faster than doing it as a batch. In general I found that the more pre-processing operations I moved to the queue (and the CPU) the better the performance.

This version still trains at about 1/2 the speed the tfrecords version did, which is a big difference, but the size of the training set has increased by orders of magnitude so I guess I can live with it. 

The code is available on my GitHub.

Libellés: python, machine_learning, tensorflow, mammography
Aucun commentaire

DDSM Mammography

mercredi 23 mai 2018

For a course I was taking at EPFL I was working on classifying images from the DDSM dataset with ConvNets. I had some success, although not as much as I would have liked, and I posted an edited version of my report on Medium.

The source code used to create and train the models is available in this GitHub repo, and the code used to create the data and do EDA is available here.

Although the course is over I am still working on this project, attempting to fix some of the issues that came up during the first stage.

Libellés: python, machine_learning, mammography, convnets
Aucun commentaire

I was training a ConvNet and everything was working fine during training. But when I evaluated the model on the validation data I was getting NaN for the cross entropy. I thought it was the cross entropy attempting to take the log of 0 and added a small epsilon value of 1e-10 to the logits to address that. I thought that would fix the problem but it did not.

Further investigation indicated that the NaNs were being introduced somewhere early in the network, in one of the convolutional layers. I checked the validation and training data to make sure there wasn't some fundamental difference between the two, thinking that maybe one was being pre-processed differently than the other, but that was not the case.

In my graph I am using tf.metrics and gathering all of the update ops into one op to be executed during training with:

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

Also gathered into this op was the updates to the batch norms. I had done this many times before with no problems at all so never thought this could be a factor. But when I removed the extra update op from the evaluation code the problem went away. Including the ops generated to update my metrics individually caused no problems. 

I am not sure what the issue actually was, but I assume it has something to do with the batch normalization, or maybe there is another op created somewhere in my graph that caused this issue.

Update - I had been restoring the weights from a pre-trained model and I think the restored batch norms caused the problem. NOT restoring the batch norms when loading the weights seems to solve this problem completely. Otherwise the issue still occurred sporadically.

Libellés: python, machine_learning, tensorflow
Aucun commentaire

I have been working on a project to detect abnormalities in mammograms. I have been training it on Google Cloud instances with Nvidia Tesla K80 GPUs, which allow a model to be trained in days rather than weeks or months. However when I tried to do online data augmentation it became a huge bottleneck because it did the data augmentation on the CPU.

I had been using tf.image.random_flip_left_right and tf.image.random_flip_up_down but since those operations were run on the CPU the training slowed down to a crawl as the GPU sat idle waiting for the queue to be filled.

I found this post on Medium, Data Augmentation on GPU in Tensorflow, which uses tf.contrib.image instead of tf.image. tf.contrib.image is written to run on the GPU, so using this code allows the data augmentation to be performed on the GPU instead of the CPU and thus eliminates the bottleneck.

This has been a life saver for me. Adding it to my graph allows me to train for longer without overfitting and this get better results.

Libellés: python, machine_learning, tensorflow
Aucun commentaire

Archives du Blogue