You need to use another kind of network. Basically a so called Fast-R network will reuse an existing network (Alexnet for instance) to detect multiple objects into an image. You will also need another dataset with many images and their associated objects to be detected:ROI (bounding boxes for all the detected objects) and object classe. The doc has an example about that.
Actually, you may use a pretrained YOLO network that works well for many applications (and it is already trained, so no long and difficult training).