UFO Dataset Enrichment: OCR Extraction and Image Captioning
Why the study?
Enrich the UFO sightings Datset by :-
- Extracting UFO sightings from PDF documents.
- Extracting UFO sightings from Images.
What I did?
To analyze -
- Two more datasets available as PDF and images were merged into the existing combined dataset.
- - PDF documents of UFO sightings from the United Kingdom.
Source - http://www.theblackvault.com/documentarchive/united-kingdom-ufo-documents/
- - Scraped 3500 images of UFO sightings reported on ufostalker.com by public
Source - http://ufostalker.com/
- Extracted features from PDFs using Tesseract toolkit by building an OCR pipeline to clean and feature-ize the extracted text.
- Extracted features from images using Inceptionv4 model by identifying objects
- Used Apache Tika to extract metadata features fromimages such as Creation date and geolocation
- Generated captions for identified objects using re-trained Inceptionv4 and Neural Image Caption Generator
- Re-Trained the last layer of Inceptionv4 model to generate better captions
- Used multiple NER-taggers on the Captions, Metadata and Description fields to featurize datset.
What did I learn?
- Importance of manual Data cleaning if data is too noisy as it was in PDFs.
- Inceptionv4 couldn't provide good results for images clicked with mostly sky or field in the background
Technologies Used:
- Python
- Apache Tika
- JAVA 8
- Tesseract
- Docker
- TensorFlow