Les billets libellés: machine_learning. Afficher tous les billets.

GAN Hacks

samedi 21 septembre 2019

I've now been trying to train my GANs for quite a while and still haven't been too successful, but I have learned some tricks. I found this excellent article a while ago and I didn't really understand it completely at first, but after having tried a lot of its tricks I understand them now. Here are my thoughts and some additional tricks I have used:

  1. Item 5 from the article - use convolutional layers with strides of 2 rather than pools : one of the biggest problems in training GANs is maintaining the gradients. Since the gradients for the generator come from the discriminator vanishing or exploding gradients are a huge problem and need to be avoided at all costs. Max pools eliminate all of the gradients but one, so convolutional layers with a stride of 2 are a better way to downsample. Average pooling will also work, but I've found that stride 2 layers work better.
  2. With apologies to Frank Herbert, "the gradients must flow." I've had luck using dense convnets as the discriminator because of the improved gradient flow they provide.
  3. Item 6 - soft and noisy labels - this has helped a LOT. I haven't tried using random labels, but I have had luck using labels that are slightly off from 0 or 1, like 0.1 or 0.99. This keeps the discriminator from becoming too confident in it's predictions and the gradient to the generator exploding. I've learned that when training GANS, exploding gradients are just as bad as vanishing gradients in that the generator learns nothing.
  4. The article also suggests occassionally flipping the labels, which I'm not sure exactly how to interpret. In practice, if the discriminator gets too strong I will occassionally flip the labels for a few training steps to confuse it a bit and then flip them back. This seems to help the generator catch up a bit.
  5. One other thing I have found is that using smaller batch sizes seems to work better. When I started using the V100 GPUs I immediately increased my batch size to the max the GPU could handle, but the generator did not learn well at all. Reducing the batch size helps a lot, possibly by introducing some additional regularization to the discriminator. 
  6. Dropout - the article mentions using dropout in the generator, which I haven't tried. I do use dropout in the discriminator, which I wasn't sure about since it will reduce the gradients, but it does help slow down the discriminator which seems to help training.
  7. Item 11 - I have tried to do this and wasted a lot of time. If your training has collapsed it is not likely you will be able to uncollapse it by training one network more than the other. I would suggest that rather than training one network more than the other you make sure that the networks are roughly equally matched from the start. Training the generator more, for example, tends to lead to mode collapse; training the discriminator more tends to lead to the gradients exploding or vanishing.
  8. Item 12 - I haven't tried this one yet but it is interesting. I've heard a lot about using auxiliary outputs to provide regularization and if I had labelled images I would definitely try this one. In fact, I may try to label my images somehow in order to do so.
  9. One thing that was not mentioned in the article, but which I have found very helpful, if using separate batches for real and fake images when training the discriminator. At first I thought this was a bizarre idea and wasn't sure how it would help, but it really does.

Some additional tips on how to construct a GAN:

  1. Start small - when I started playing with GANs I immediately made two large, deep convnets and tried to train them and they learned nothing. I recommend you start with a very small network, train it enough to make sure it is learning something, then add a layer and repeat. I still don't know what the problem with my original networks was, or if I just wasn't patient enough, but it's a lot easier to find problems if you add one layer at a time (or one block at a time) than if you start off with a 100 layer network.
  2. Keep things simple - training a GAN involves making sure that two networks are roughly learning at the same pace, it's a delicate dance and I would recommend not throwing too many bells and whistles into it. As in the previous tip, make sure everything is working properly first before you add some newfangled loss function or dynamic loss weighted or anything into it.

 

Libellés: machine_learning, gan
Aucun commentaire

K80 vs V100

lundi 16 septembre 2019

Discovering how much cheaper spot EC2 instances were than normal on-demand instances gave me the courage to try out a faster GPU. I had been using K80s which are painfully slow, but very cheap. The spot price for the V100 is about the same as the on-demand price of the K80s, so using those with spot instances won't be any cheaper, but it won't be more expensive either.

I didn't think the V100s were such great GPUs, so I wasn't expecting it to be worth the extra cost. How wrong I was. Training the network I am currently playing with on a K80 with a batch size of 48 took about 8-12 hours per epoch. Training it on a V100 with a batch size of 64 is looking like it's going to take about 2 hours. With the V100s priced at about 4x the K80s, that works out to about the same price per compute to a little bit cheaper, depending on exactly how long it took per epoch on the K80.

When you factor in the value of not having to wait an entire day to see the results of an epoch, this is a no-brainer as far as I'm concerned. Unfortunately, I'm sure my AWS bill is going to increase substantially. That's how they get you... Once you have a taste of HPC they know you'll be back for more...

Libellés: machine_learning, ec2, aws, gpu
Aucun commentaire

AWS EC2 Spot Instances

jeudi 12 septembre 2019

My major complaint about using EC2 GPU instances was the cost, it gets very expensive to run a GPU instance for more than a few hours. Last week I was wondering why I wasn't using spot instances, so I set up a request and I've been running it for a few days now. It is about 1/4 the price of a normal instance, so it's not much more expensive than renting a CPU-only on-demand instance. I was hoping to get a better GPU than the K80, but I ended up settling for the K80 because it was more available than the better GPUs, but next time I may request a better one and see what happens.

The downside of spot instances is that they will be terminated if the capacity is needed for an on-demand instance, and my instance was terminated the other night. But then I spun up a new one in the morning and that one has been running for a few days now. I can't believe I haven't used these before.

Libellés: machine_learning, aws
Aucun commentaire

It is difficult to play around with the structure for the GAN I am working on in Colab since it trains so slowly. I can usually get maybe 2 or 3 epochs in a day, which means that I need to wait a day before evaluating each change I make. I decided to rent a GPU in the cloud for a few days so I could train it a bit more quickly and figure out what works and what doesn't work before going back to Colab.

I already have a Google Cloud GPU instance I was using for my work with mammography, but it was running CUDA 9.0 which apparently is not supported by PyTorch out of the box. I tried to upgrade CUDA to 10, but I think I ended up just making things worse. Rather than spend a whole day trying to fix the GCS instance, and since I have some AWS credits, I decided to try to use an AWS Deep Learning AMI instance, which already has everything configured.

It was incredibly easy to get set up, it comes pre-configured with virtual environments for different deep learning frameworks and packages, so there is no need to install CUDA or drivers or anything like that, which is a huge advantage, since back when I was setting up the GCS instance it took me a few days to get everything installed and working. One thing I quickly noticed was that the default disk size was not even close to big enough - after downloading a few data files I was already running out of disk space, but it was very easy to increase the disk size.

Then all I had to do was activate the pytorch environment, launch a notebook and everything was running smoothly. I did run into a few minor issues, none of which were difficult to resolve:

  • If I launch tmux from within a virtual environment it launches a session that does NOT have the environment activated. Then if I activate the environment from within tmux it doesn't have access to the proper modules. This was resolved by launching tmux from outside of the venv, and then activating the venv from inside tmux.
  • In my notebook it didn't seem to have access to pytorch, but this was because I hadn't selected the proper kernel from the kernel -> change kernel menu. I wasn't even aware that one could select the kernel like that.

I used to prefer GCS to AWS because it was more configurable and easier to use. While AWS does have a bit of a learning curve, they really have thought of and provided for just about every possible contingency. We use AWS at my work, and it really is very impressive. I still like the simplicity of GCS, but even simple things like AMIs make such a huge difference in set-up time that I think I'll be using AWS more often now.

Libellés: machine_learning, aws, pytorch
Aucun commentaire

I had been trying to train my autoencoder with a GAN component on and off for a couple of months and it just didn't seem to be working very well. I thought that maybe the autoencoder and the discriminator errors were somehow cancelling each other out or something. Just for the hell of it I decided to try to use the discriminator to optimize a reconstructed image to look real, just to see what the result would be. Instead of optimizing the weights, I created a Variable of the input and optimized that instead. To my surprise I ended up with weird splotches of primary colors against a white background, it actually made the image look less and less real rather than more. After seeing that I decided that there must be some major problem with my code so I went through it in greater detail.

I decided to train all three networks from scratch (the three being the encoder, the decoder and the discriminator) to see what would happen. I was surprised to see that the generator did not seem to be learning ANYTHING and neither did the discriminator. I found a tutorial on creating a GAN in PyTorch and I went through the training code to see how it differed from mine. 

I had written my code to optimize it for speed, training the autoencoder without the GAN already took about 4 hours per epoch on a (free) K80 on Colab so I didn't want to slow that down much more, so I tried to minimize the numebr of times data had to be passed through the networks. The tutorial did not do that. First it ran a batch of real data through the discriminator, computed the gradients but did NOT back propagate them. Then it used the generator to generate a batch of faked data, passed that through the discriminator, computed the gradients, added them to the gradients from the first batch and THEN did the back prop. Then it ran the same batch of faked data through the discriminator again, and used that to update the generator. This was different from my code in several major ways:

  1. I was using a single batch containing half real images and half reconstructed images to train my discriminator.
  2. I was training passing data through each network one single time per batch.
  3. I wasn't detaching the reconstructed data before passing them through the discriminator.

After updating my code to bring it more in line with the tutorial both networks began to learn, I think that major change was detaching the reconstructed images before putting them through the discriminator. However I noticed a few strange things regarding the discriminator batches:

  • If I used a single batch containing both real and constructed images to train the discriminator it learned very quickly, it's loss approached 0 very quickly, and the discriminator loss component of the generator overwhelmed the autoencoder loss, which sort of fluctuated but didn't decrease very much.
  • If I trained using two batches, each containing only images for a single label, it's accuracy hovered around 50% and the autoencoder loss decreased rapidly.

I read in a couple of places that using separate batches was a trick to make GANs train better, but no one really had an explanation for why this worked. What I am currently doing it using separate batches most of the time, before every n batches I use a single batch to encourage the discriminator to learn a bit more. I've tested values for n of 8, 16, 32 and 64. Most of those seemed to result in the worst of both worlds, nothing really seemed to improve, but with n = 64 the autoencoder loss is again decreasing, although slowly, and the discriminator accuracy is hovering around 52% rather than the 49-50% it was at using all separate batches.

To me using separate batches doesn't intuitively make sense, I don't see how the network can really learn to differentiate between classes when it only sees one class at a time. Of course the gradients are then added, and the differences should cancel out, with what's left indicating how to differentiate the classes; but to me it seems much more efficient to learn from mixed batches. One would never consider training a network on, say, the CIFAR dataset with each batch consisting exclusively of a single class. Maybe that's the point, to slow down the discriminator's learning enough for the generator to keep up? Anyway I will continue to experiment and see what works and what doesn't work.

 

 

 

 

Libellés: machine_learning, pytorch, autoencoders, gan
Aucun commentaire

Archives du Blogue