SHELDON
  • Insights
  • About

SHELDON-AFWERX Discussion 2023-06-26

Revised BERTopic Models for Phase III FPDS Text

AFWERX
Phase III
BERTopic
Representation Methods
Embedding Models
Author

SHELDON

Published

June 26, 2023

SHELDON-AFWERX 2023-06-26 Discussion
Discussion of the multiple methods tested to try to produce better quality BERTopic labels for AFWERX entity subsequent Phase III awards.

Data Rights

Contract Number: N6833522C0500
Contractor Name: P.W. Communications, Inc.
Contractor Address: 11200 Rockville Pike Suite 130 Rockville, MD 20852
Expiration of Data Rights Period: January 16, 2029

The Government’s rights to use, modify, reproduce, release, perform, display, or disclose technical data or computer software marked with this legend are restricted during the period shown as provided in paragraph (b)(4) of the Rights in Noncommercial Technical Data and Computer Software Small Business Innovation Research (SBIR) Program clause contained in the above identified contract. No restrictions apply after the expiration date shown above. Any reproduction of technical data, computer software, or portions thereof marked with this legend must also reproduce the markings.

Updates

  • Implemented method to collapse outlier topic into real topics

  • Tested multiple embedding models

  • Implemented multiple topic representations

Distinct Text Level Topics

  • SciKit Basic
  • SciKit Basic with KeyPhrase Labels
  • KeyPhrase Basic Reduced Clusters
  • KeyPhrase Basic Smaller Clusters
  • Final KeyPhrase Smallest Cluster Size all-mpnet-base-v2 Model
  • Final KeyPhrase Smallest Cluster Size all-MiniLM-L6-v2 Model

Distinct Award Level

  • Small Clusters
  • Large Clusters

Keyword Extraction

  • SKLearn KeyBERT
  • KeyPhrase KeyBERT

YAKE

Implemented another keyword extraction tool that is significantly faster and lower compute than KeyBERT.

The Yake and KeyBERT extracted keywords absolutely have analysis value and they might also help guide slightly better BERTopic labels if we could seed the model with pre-defined labels using these keywords or other conceptual award types for FPDS Phase IIIs from subject matter experts.

Collocations

The BERTopic/YAKE process helped elucidate that in certain cases one can infer or in certain cases know with reasonable certainty the provenance and origin of the FPDS Phase III award. Collocation analysis combined with some of the odd BERTopic labels led me to this finding.

Full FPDS Phase III Collocations: Bigrams, Trigrams and Quadgrams

Methodology, First Guess Targeted Collocations to Find Provenance

This was just a best guess of possible phrases and regex patterns that might find blocks of texts where we can figure out how this FPDS Phase III came to be. A more exhaustive corpus or search terms and discussions on patterns in topic codes and contract id references would significantly improve the quality of this method.

  • previous sbir
  • previous sttr
  • previous sb
  • prior sbir
  • prior sttr
  • topic af
  • topic fa
  • transition
  • af[0-9]
  • fa[0-9]
  • open topic
  • phase i
  • phase ii

2025, SHELDON