Deep Learning Object Recognition and Image Captioning

UFO Dataset Enrichment: OCR Extraction and Image Captioning

Enrich the UFO sightings Datset by :-

To analyze -

Two more datasets available as PDF and images were merged into the existing combined dataset.

- PDF documents of UFO sightings from the United Kingdom.
Source - http://www.theblackvault.com/documentarchive/united-kingdom-ufo-documents/
- Scraped 3500 images of UFO sightings reported on ufostalker.com by public
Source - http://ufostalker.com/

Extracted features from PDFs using Tesseract toolkit by building an OCR pipeline to clean and feature-ize the extracted text.
Extracted features from images using Inceptionv4 model by identifying objects
Used Apache Tika to extract metadata features fromimages such as Creation date and geolocation
Generated captions for identified objects using re-trained Inceptionv4 and Neural Image Caption Generator
Re-Trained the last layer of Inceptionv4 model to generate better captions
Used multiple NER-taggers on the Captions, Metadata and Description fields to featurize datset.

Importance of manual Data cleaning if data is too noisy as it was in PDFs.
Inceptionv4 couldn't provide good results for images clicked with mostly sky or field in the background