Introduction

As a part of this article, we delve into the paradigm of Chain of Though reasoning in Large Language Models. The aim is to highlight the importance of this idea and summarize the main research in this area. The blog should provide enough context for the reader in the field of AI to understand the basic concepts and think about the potential research ideas addressing the limitations of the current models.

Chain of thought Reasoning

Chain of thought (CoT) refers to manifesting the human thought process in large language models by endowing language models with the ability to generate a chain of thought - a coherent series of intermediate reasoning steps.

It is hypothesized that CoT prompting helps LLMs to tackle complex arithmetic, commonsense and symbolic reasoning tasks. The following demonstration highlights this improvement.

COT

However, there are some limitations with this paradigm of reasoning with the current models.

  • Small models are unable to improve with CoT prompting. LLMs with more than 100B parameters show performance gains with CoT.
  • The performance improvements are larger with larger models. In other words, the benefits of CoT scale with the size of the models.
  • Sometimes the models arrive at the correct answers with the wrong reasoning. The errors have been classified as
    • Calculation error - LLMs are probabilistic models, predicting what token occurs next. So when an LLM tries to do \(3* 25* 8 =\), it does not really calculate the answer but probabilistically guesses the answer which is the next token. This highlights a fundamental limitation in the current architectures of LLMs.
    • Symbol mapping error - When there are too many variables involved, LLMs sometimes mix up the variables and arrive at the wrong answer. Again, the problem arises from the fundamental architecture flaw highlighted in the previous point.
    • Other than these major errors, the models also have semnatic understanding problems, missing steps, incoherent chain of thought errors

Large Language Models are Human-level prompt engineers

The motivation of this paper is as follows -

  • Human effort in prompt engineering - Crafting effective prompts for LLMs is time-consuming and requires significant human expertise.

  • Optimization challenge - Primpts greatly influence LLM performance, but users often lack insight into how to optimize them for specific tasks.

  • Scalability - As LLMs grow in size and capabilities, manuallt designing prompts becomes less deasible for a wide range of applications.

  • Automating promtp design - There is a growing need to automate the prompt engineering process to enhance LLM usability and performance.

  • Real-world impact - Applications in diverse domains (e.g., AI chatbots, automated content generation) can benefit from optimized and automated prompts.

This work promposes an Automatic Prompt Engineer (APE) - asystem that automates prompt generationg and selection for Large Language Models. This task is treated as a program synthesis task wherein the input-output pairs (natural language questions and answers) are given to the APE, and it has to generate the instruction needed to generate these pairs.

In essence, the APE is trying to learn the prompts generated by humans. The framework is as follows -

  1. Instruction Generation. An LLM is used as an ingeerence model where the “instruction candidates” are generated based on a small set of input-output demonstrations

    Example: The input to APE is of the form -

    Input 1 - Forward generation technique

    ”””

    I gave a friend an instruction and five inputs. The friend read the instruction and wrote an output for every one of the inputs. Here are the input-output pairs

    Input: [ ] Output: [ ]

    Input: [ ] Output: [ ]

    The instruction was <COMPLETE>

    ””””

    Input 2 - Reverse generation technique

    ”””

    I instructed my friend to <INSERT>

    The friend read the instructions and wrote an output for every one of the inptus. Here are the input-output pairs:

    Input: [ ] Output: [ ]

    Input: [ ] Output: [ ]

    ”””

  2. Scoring Instructions. Evaluate each instruction by computing a score that reflects how well the instruction guides the target LLM for the task. This is simply the confidence score associated with the log likelihoods of token generation. The authors consider a moving average score considering the probabilities for a window of tokens.

    They also consider an execution accuracy - the success of an instruction by checking if the model produces the correct output (0-1 loss). However, this cannot. be used for all kinds of instructions.

    The top $k$-percentile prompts are selected and the rest are discarded.

  3. LLM as Resampling Model. They apply an Iterative Monte search method to resample more prompts. The LLM generates semnatically similar instructions variants to improve the top-performing candidates.

    Once the prompts are generated, the moving average scores are generated for each of the prompts and the better scoring prompts are selected again.

Can APE be used to guide LLMs?

Although this is a very simple example, the work shows potential in taking such framework forward to work with more complex applications.

Another interesting approach is to not generate the prompts from scratch, but to help humans design better prompts. Essentially, augment with context from humans to generate better prompts. On the flipside, RLHF can be used to improve these APE.