I stumbled upon this paper while contemplating whether it’s possible (and valuable) to build a network representation of technology - eg. which tech requires which lower-level tech, which tech depends on which natural phenomena, how tech shifts over time, what feedback loops are present, … An ill-defined task that remains just a draft, a fun thought experiment; but let’s switch to the paper now.

Link: DeepDive: declarative knowledge base construction

The paper

The author(s) are part of Hazy Research, the group that recently created FlashAttention - an algorithm to speed up training, increase context window of transformers (and something that I need to read more about). Check out their blog, it’s full of good stuff.

Declarative Knowledge is the set of true sentences that can be stated (by an individual, a collective, in a book, by humanity, on a topic, on everything, …).

Declarative knowledge is an awareness of facts that can be expressed using declarative sentences. It is also called theoretical knowledge, descriptive knowledge, propositional knowledge, and knowledge-that. It is not restricted to one specific use or purpose and can be stored in books or on computers.

Declarative knowledge is often contrasted with other types of knowledge. A common classification in epistemology distinguishes it from practical knowledge and knowledge by acquaintance. All of them can be expressed with the verb “to know” but their differences are reflected in the grammatical structures used to articulate them. – Declarative knowledge - Wikipedia

On computers, knowledge bases are typically structured repositories of information that can be queried. Knowledge is encoded by entities and relations between them. Populating these structured databases from unstructured data is called Knowledge Base Construction. It is done by humans (experts) or by automated means.

KBs may seem to have been surpassed by LLMs nowadays. The methods in the paper belong to an older era of Machine Learning i guess, judging by the involvement of an Engineer-in-the-loop who is tasked with explicitly defining the domain-specific relations to be extracted from the large, unstructured input dataset.

The paper was published in 2017 and the project is now discontinued. However, I think it’s worth looking over the ideas.

KB quality is evaluated by Precision and Recall. When you ask a question, the answers the KB returns are useful if they measure well on the criteria of:

  • Precision: the information is correct, has high precision; false answers are not included;
  • Recall: the important information is revealed; good answers are not lost in the system.

Some applications of KBs:

  • Medical Genetics: statistically infer relations between genes, variants, phenotypes (what the organism looks like)
  • Pharmacology: genes-drug-disease interaction
  • Paleontology: fossils, tracing evolution
  • Human Trafficking: aiding investigation

DeepDive is a KBC framework: you can use it to build KBs. The KB produced by DeepDive is a static DB populated by entities and relations extracted by the construction process.

The KB Construction process has three phases:

  1. Candidate Mapping and Feature Extraction:
  2. Learning and Inference
  3. Error Analysis

The key ideas

  • it’s a framework to construct Knowledge Bases, not a KB system.
  • it achieves performance equiv or above human-constructed KBs
  • human-maintained KBs become impractical in fast-growing fields Medicine, Genetics
  • … (wip)

Refs

See “Epistemology” for more on Knowledge, Declarative Knowledge:

Glossary