Six Principles for the Effective Use of Artificial Intelligence Large Language Models – The CPA Journal

Get Copyright Permission

In Brief
Large Language Models (LLM) like ChatGPT, Bing Chat, Bard, and others can improve the efficiency of common language generation tasks performed by CPAs, but using LLMs entails certain risks. To help CPAs reap the benefits while minimizing the risks of this emerging technology, the authors suggest six principles to guide CPAs as they incorporate LLMs into their work.
Recently, the topic of artificial intelligence (AI) has captured the imagination of the public and professionals alike. Image generators like Dall-E 2 and Stable Diffusion have challenged the definition of art. The increased use of self-driving cars has challenged the role of human drivers. Large Language Models (LLM), including ChatGPT (by OpenAI), Bing Chat (Microsoft), and Bard (Google), have emulated human conversation and facilitated professional and creative writing. LLMs have been quickly adopted by the public. For example, ChatGPT surpassed 100 million users within two months of its release in November 2022, smashing the adoption record previously held by Google, which took one full year to draw the same number of users (“ChatGPT Statistics for 2023: Comprehensive Facts and Data,” DemandSage, 2023, https://www.demandsage.com/chatgpt-statistics/). Nearly 30% of professionals acknowledge having used ChatGPT at work, according to a survey by Fishbowl, a Glassdoor product (“Almost 30% of Professionals Say They’ve Tried ChatGPT at Work,” Bloomberg News, 2023, https://tinyurl.com/3jfempyh). In March 2023, ChatGPT and Bard were the most well-known and used LLMs for public use, with many considering ChatGPT to be more useful (“How ChatGPT and Bard Performed as My Executive Assistants,” New York Times, Brian Chen, March 29, 2023; https://tinyurl.com/4dt98mrm). As of August 2023, the most prominent LLMs are ChatGPT (by OpenAI), Bard (Google), Bing Chat (Microsoft—built using ChatGPT), LLAMA (Meta), Bloom (Big-Science), and Dolly (Databricks) (“Popular Open-Source Large—Language Models,” Analytics Insight, 2023, https://tinyurl.com/5fb3h9e3).
An LLM is best thought of as an extremely fluent extension of predictive text. Just as a smartphone or e-mail client can suggest one or two words based on the context of the sentence being written, an LLM utilizes its training on a vast body of text (its “corpus”) to generate text in response to a user’s prompt. Instead of merely predicting the next two or three upcoming words to complete a phrase or sentence, LLMs can reply to a user’s prompt with several paragraphs of coherent (and often accurate) text across many domains. As of August 2023, LLMs can accept input files and create text (including computer code) from those inputs. One of the primary features of LLMs that distinguish them from other AI models is that LLMs can generate a response to a prompt on an unknown (to them) topic. The response may be highly inaccurate, but the response is based on LLMs’ contextual knowledge of (what they deem are) similar topics.
LLMs are adept at interpreting, synthesizing, and expressing results in natural language. These capabilities have been applied in a variety of fields, including writing product recommendations for retailers, reading transcripts and summarizing earnings calls, and preparing marketing copy. CPAs—especially at firms with limited resources—may be curious to experiment with LLMs to improve the efficiency of common language generation tasks. For example, LLMs can be used to form a rough draft summarizing tax issues to a taxpayer client, prepare a memo to staff, draft a courteous e-mail to an audit client, describe how to conduct a particular process, conduct research into technical topics, or quickly gain an understanding of the risks and opportunities facing a particular industry.
Using LLMs effectively requires an appreciation of the risks of this technology. One of the major risks is data privacy. CPAs have a responsibility to protect client data, and need to practice extreme caution when using LLMs with client data. LLMs have many features (for example, as of August 2023, ChatGPT has the ability to accept multiple input files of data; “ChatGPT—Release Notes,” OpenAI, 2023, https://tinyurl.com/25a27fhd) that make it easy and enticing to input client data into the LLM, but uploading data entails significant data security risks. Other risks associated with LLMs include their limited ability to provide useful responses in specialized bodies of knowledge, their struggle to adequately process facts that adhere to a timeline, the difficulty in sorting fact from fiction in their responses, the need to critically assess the rationale for their answers, their struggle to cite their sources, and determining which degree of confidence should be placed in their answers.
The accounting profession is currently navigating the tension between the advantages offered by LLMs and their risks. For example, in February 2023, PwC Australia encouraged their staff to “experiment with ChatGPT personally, and to think about the role generative AI could play in our business” (“PwC Warns Staff against Using ChatGPT for Client Work,” Australian Financial Review, 2023, https://tinyurl.com/43de5crw), but cautioned:
It is important that we navigate the risks and limitations of using [LLMs]… They are stochastic in nature (meaning you could get a different answer every time you ask the same question), they can present inaccuracies as though they are facts, and they are prone to user error. They will require review and oversight and cross-validation of results before they can be relied upon for tasks that demand precision.
Six Principles for the Effective Use Of LLMs

Research on the performance of LLMs in the accounting domain is rapidly emerging. In January 2023, a team of 327 researchers compared the performance of ChatGPT 3.5 to university students. They found that, although students outperformed ChatGPT 3.5 by an average score of 76.7% versus 56.5%, ChatGPT 3.5 performed better than students on a not-inconsiderable 15.8% of assessments (David A. Wood, et al., “The ChatGPT Artificial Intelligence Chatbot: How Well Does It Answer Accounting Assessment Questions?”, Issues in Accounting Education, November 2023). However, four months later, Eulerich, Sanatizadeh, Vakilzadeh, and Wood found that the next generation of ChatGPT (ChatGPT 4.0) was able to “easily pass” the CPA, CMA, CIA, and EA (enrolled agent) exams with an average score of 85.1% (“Can Artificial Intelligence Pass Accounting Certification Exams? ChatGPT: CPA, CMA, CIA, and EA?”). A related line of literature applies LLMs to hypothetical or actual accounting practice. Street and Wilck applied ChatGPT 3.5 to the forensic accounting domain and concluded that forensic accountants need to apply their own accounting expertise to critically evaluate LLM responses for shortcomings (Daniel Street and Joseph Wilck, “‘Let’s Have a Chat’: Principles for the Effective Application of ChatGPT and Large Language Models in the Practice of Forensic Accounting,” Journal of Forensic and Investigative Accounting, July–December 2023, http://dx.doi.org/10.2139/ssrn.4351817). Emett et al. documented the application of ChatGPT 3.5 in a large energy company’s internal audit function and estimate efficiency gains of 50-80% (Scott A. Emett, Marc Eulerich, Egemen Lipinski, Nicolo Prien, and David A. Wood, “Leveraging ChatGPT for Enhancing the Internal Audit Process–A Real-World Example from a Large Multinational Company,” 2023, working paper, http://dx.doi.org/10.2139/ssrn.4514238). Notably, each of these articles identified risks and challenges facing accountants seeking to benefit from LLMs.
Grounded in this recent research, this article provides six principles to guide CPAs as they incorporate LLMs into their workflow. Applying these principles will enable CPAs to minimize the risk of this emerging technology while still reaping the available benefits of increased efficiency and effectiveness. After describing each of the six principles, this article provides an illustration that demonstrates how applying these principles improves the performance of LLMs on a tax accounting task.
LLMs are trained on a broad corpus of text and have a wide range of domain knowledge. This can make them very effective tools, but asking the right questions in the accounting domain takes practice. Prompts (questions) for LLMs are most effective if they are specific and reference the context of the prompt. For example, the prompt “How should a building be valued on the balance sheet under U.S. GAAP?” will be more effective than “How should a building be valued on a balance sheet?”. CPAs should also either avoid or define ambiguous terms within the prompt. “How do I account for the purchase of a bond as an investment?” is more likely to be effective than “How do I record a bond transaction?”, because “bond” could either refer to the transaction in which a bond is issued or purchased. Users can also provide an example in the prompt and request that the LLM use a specific framework (e.g., “Prepare an expected credit loss schedule for my receivables using the following table format”).
In summary, we recommend that CPAs provide LLMs with prompts that are specific, provide the context for the question, explain or define ambiguous terms, and provide an example (if applicable).
The breadth of information in an LLM’s corpus can lead to problems when different accounting bases or tax standards can apply to a given prompt. For instance, unless instructed clearly, an LLM could provide information from either U.S. GAAP or IFRS; thus, it is critical that CPAs must exercise skepticism and utilize their expertise to ensure that the proper concepts have been applied in the output of the LLM. Only with sufficient expertise will an accountant recognize when IFRS standards are being applied, but U.S. GAAP standards are needed. (Different LLMs have different strengths and weaknesses: It’s worth noting that, as of August 2023, ChatGPT does not reliably cite its sources; in fact, sometimes when you ask ChatGPT to cite sources, it makes up sources that don’t exist. On the other hand, Bing Chat reliably cites its sources by providing web links.) In other words, LLMs are subject to significant rates of “Type 2 error”: LLMs may provide responses that appear relevant and accurate on their face, but are inaccurate when reviewed more carefully. Therefore, we recommend that CPAs maintain a high degree of concept skepticism, critically assessing whether the concept that the LLM has applied in a given response is indeed the correct concept expected to be applied (with respect to accounting basis, framework, or standard).
LLMs are trained on a corpus of data that spans time. Although this provides them with an impressive breadth of knowledge, LLMs struggle to differentiate between information over time. An LLM might provide the current accounting and tax standard information, or it might provide outdated and superseded information. CPAs should also understand and carefully consider the extent to which the LLM they employ has access to current information. The corpus supporting ChatGPT 3.5, for example, only consists of information through the end of 2021; ChatGPT 3.5, therefore, will not be able to provide tax or accounting information that has changed since that time. Its response to a request for the IRS mileage rates or the deductibility of medical expenses may or may not contain correct information for the proper tax year. Similarly, it will not be able to describe the impact of any current economic or business events occurring after that date. Other LLMs including Bing Chat and Bard have access to Internet information sources, and can more reliably refer to current information. We recommend that CPAs critically assess whether the information that the LLM has provided indeed applies to the proper time period.
In the authors’ experience, LLMs provide more accurate responses to specific tasks and questions rather than large or complex tasks. For example, ChatGPT 3.5 provides more accurate responses when prompted to generate the assets, liabilities, or stockholders’ equity portions of a balance sheet separately rather than preparing an entire balance sheet at once. We recommend that CPAs break down large tasks into smaller portions and ask LLMs to perform those smaller tasks in sequence. This also allows for a user to verify the correctness of each smaller task before beginning the next step in the process.
In addition to the data provided in its corpus, an LLM can receive and respond to user-submitted text and data. Although this feature facilitates the broad functionality of LLMs, inputting private, sensitive, or proprietary data input into an LLM, raises serious concerns about data storage and privacy issues. For example, LLMs could incorporate user-submitted data into their knowledge and then synthesize and provide this information to other users without the right to this data. Until and unless LLMs are implemented with reliable privacy safeguards (including an “on-premise server” or a “private instance” for a particular company or firm), the authors recommend that CPAs do not input private, sensitive, or proprietary information into LLMs.
LLMs are designed to have natural conversations with users. ChatGPT, Bing Chat, and Bard all attempt to remember the context of a conversation and, once effectively prompted, stay within the scope of the topic. If the LLM misinterprets a prompt or additional information is needed, users can ask follow-up or clarifying questions and the LLM will consider them within the context of the earlier questions within the conversation. For example, if you ask an LLM about how a building should be valued under U.S. GAAP and then in a subsequent question ask, “Which depreciation methods are acceptable?”, it is likely to tailor its responses to depreciation methods common for buildings. Therefore, we recommend that users ask follow-up questions and provide clarifications and directions within a given conversation. But exercise discretion with feedback—LLMs can learn from user feedback and incorporate it into its base of knowledge.
As of August 2023, ChatGPT, Bing Chat, and Google Bard all have the ability to save previous chats (known as “chat history”). However, ChatGPT does not actively remember the context of previous chats when answering responses in a new chat (OpenAI Developer Forum, April 2023, https://community.openai.com/). Similar documentation for Bing Chat is unavailable, but it is reasonable to expect that Bing Chat also does not remember the context of previous chats when responding to a new chat; by contrast, Google Bard was designed to remember previous conversations. Bard’s capability is limited by the amount of computing memory and available capacity; thus, its recollection may not be very good (Bard FAQ Page, August 2023, https://bard.google.com/faq). Because many LLMs cannot access context developed in other conversations, the authors recommend that CPAs create a new conversation when switching to a new task or topic.
Although impressively fluent with text, LLMs are quite poor at any but the clearest quantitative tasks. It is important for CPAs to remember that the fundamental task of an LLM is to suggest a sequence of text that best responds to the user’s prompt. LLMs are not designed to produce quantitative responses, nor are they skilled at mathematical procedures, although they will provide numeric text if that text is predicted to be the best in a given response. For example, when the authors tasked ChatGPT with creating an income statement based on several business transactions, a portion of its response asserted that $2.00 + $5.50 + $7.50 = $17.50. Similarly, when preparing a partial balance sheet, it asserted that fixed assets “have a total value of $1,775.00 but also have been depreciated by $1,055.00. The net value of the fixed assets after depreciation is $4,720.00.” CPAs are much better off utilizing spreadsheet software, business calculators, or nearly any other alternative software rather than relying on an LLM to conduct mathematical procedures. CPAs can ask LLMs to suggest, identify, or describe the mathematical process to be used for a given scenario, but we recommend that CPAs do not use LLMs to actually perform mathematical calculations.
Like many other LLMs, ChatGPT was trained on a large corpus that included both fact and fiction across many knowledge areas. Approximately 3% of ChatGPT’s corpus consists of “facts” from Wikipedia and an additional 38% has been vetted by humans beyond the original author. [ChatGPT’s corpus consisted of a general “crawl” of the Internet (60%), websites with a certain level of upvotes (22%), books (16%), and Wikipedia (3%) (“Language Models are Few-Shot Learners,” Tom B. Brown, Benjamin Mann, Nick Ryder, et al., 2020, https://arxiv.org/abs/2005.14165). Although the general “crawl” provided the largest amount of data included in the corpus, Wikipedia data was given the highest weight, followed by websites with the most up-votes. ChatGPT’s crawl of the Internet allowed it to learn how to use conversational language across a wide breadth of content.] In addition to factual knowledge, ChatGPT has been designed to engage in human-like conversation and is capable of creating new answers; it can create new computer code, write a poem, or speak in a given person’s style. Its training encouraged it to communicate facts as well as to communicate with human-like creativity. Unfortunately, ChatGPT does not articulate to the user whether it is conveying facts or displaying creativity; hence, it may lie. Other LLMs have different training and reward structures, and are capable of providing references for facts.
The “knowledge” of an LLM is probabilistic. The authors have encountered situations where ChatGPT’s responses have disagreed with each other, even in the same conversation. For example, the authors have observed ChatGPT state that an event has occurred in one year, but when asked again, it states that the event actually occurred in a different year. Certain firms have prohibited the use of ChatGPT due to concerns over factual accuracy (“PwC Warns Staff against Using ChatGPT for Client Work,” Australian Financial Review, 2023, https://tinyurl.com/43de5crw). We recommend that CPAs seek “knowledge” or “facts” from sources other than an LLM. CPAs should evaluate and cross-check any factual claims provided by an LLM using a reliable alternative source.
The more frequently an item appears in an LLM’s corpus, the more knowledgeable the LLM will be about that topic. More prominent concepts, events, and firms appear more frequently in the corpus, and allow the LLM to respond more capably to prompts on these subjects. For example, an LLM is much more capable of addressing questions about federal tax law than about municipal tax regulations. Therefore, we suggest that CPAs rely more on the responses provided by LLMs for topics that are more prominent.
LLMs can make the work of a CPA more efficient and potentially more effective, but they cannot replace a CPA’s expertise. The authors have successfully used LLMs to enhance our work by having them:
Although the authors recommend that accountants use LLMs to enhance their work on a variety of tasks, we do not predict the “death of accounting” at the hands of LLMs anytime soon, for at least four reasons:
To illustrate how CPAs can apply these six principles in practice, consider the following hypothetical situation: The year is 2021. You sent an update to your client list about eligibility for the California Young Child Tax Credit. One of your clients falls within the phase-out range for this credit (Regina Jenkins, a W2 employee with AGI of $28,000 and single mother of two qualifying children, a 3-year-old daughter and a 7-year-old son). Regina has heard a lot about the Federal Advance Child Tax Credit Payments in 2021, so she sent you an e-mail asking whether she’ll be receiving monthly checks for this California credit, and if so, whether and how she should opt-out.
Prompt 1:
I sent all of my clients an email about an upcoming tax credit [1] that they may qualify for. [2] Regina Jenkins [3], my client, has an adjusted gross income of $28,000. Regina told me that she is eligible for the tax credit. Does that mean she’ll receive a check immediately? [4] How much would the check be? [5] Should she opt out of the payment? [6,7]
Principles Violated by Prompt 1:
Performance of selected LLMs on Prompt 1. The authors input “Prompt 1” into ChatGPT, Bing Chat, and Google Bard in August 2023.
ChatGPT. ChatGPT responded about tax credits broadly, rather than seeking clarification about which tax credit information was sought. Although each of the questions in the prompt was addressed in the response, the answers were not sufficiently precise. A portion of the response provided by ChatGPT follows: “The eligibility for a tax credit doesn’t necessarily mean that the individual will receive a check immediately. Tax credits can affect a person’s tax liability, which could result in a refund or reduced tax owed when they file their tax return for the relevant tax year.” (ChatGPT 3.5, OpenAI, https://chat.openai.com/). At the end of its response, ChatGPT acknowledged its limitations and caveated its response with a recommendation that Regina should “reach out to a tax professional who can guide her based on her individual circumstances and the specific tax credit in question.”
Bing Chat: Rather than providing information about a given tax credit or about tax credits broadly, Bing Chat requested more specific information from the user. A portion of its response follows: “It is not clear which tax credit you are referring to. There are many tax credits available, and the amount and eligibility requirements vary depending on the specific credit. Could you please provide more information about the tax credit you mentioned in your e-mail to your clients? That way, I can better assist you with your question.” (Bing Chat, Microsoft, https://www.bing.com/search/)
Google Bard: Google Bard assumed we sought information about the Federal Earned Income Tax Credit (EITC). Based on this assumption, it attempted to calculate Regina’s EITC, then stated that Regina would not receive an advance payment. A portion of its response follows: “The tax credit you are referring to is the Earned Income Tax Credit (EITC). Regina is eligible for the EITC because her adjusted gross income (AGI) is below the threshold for her filing status. The maximum EITC for a taxpayer with an AGI of $28,000 is $480. Regina will not receive a check immediately. The EITC is a refundable tax credit, which means that it can reduce Regina’s tax liability to zero and she will receive any remaining credit as a refund. Regina should not opt out of the payment. The EITC is a valuable tax credit that can help Regina reduce her tax burden and get a larger refund.” (Bard, Google, https://bard.google.com/).
By contrast, Prompt 2 adheres to the above principles by relying on the CPA’s judgment and expertise at quantitative tasks while leveraging LLMs’ fluid writing abilities. In the following prompt, the underlined portions in the following prompt demonstrate how the prompt adheres to the six principles, which are specified in brackets. (Neither the underline formatting nor the bracketed contents were provided to the LLMs.)
Prompt 2:
I have a client [Principle 3] who is eligible for the 2021 California Young Child Tax Credit [Principle 1]. They are subject to a phaseout [Principles 5, 6] because their income is between $25k and $30k. Because of the phase-out, they are eligible for a total annual refundable credit of $434 [Principle 4] on their year-end tax return rather than the maximum possible credit of $1,083. Please draft an e-mail to send to my client to explain this situation. [Principles 1, 7]
Performance of selected LLMs on Prompt 2: The authors input “Prompt 2” into ChatGPT, Bing Chat, and Google Bard in August 2023. ChatGPT, Bing Chat, and Google Bard each generated an e-mail template in a matter of seconds. The responses from each of the tools were noticeably more accurate than those provided in response to Prompt 1.
ChatGPT: ChatGPT provided a well-written, professional e-mail that cogently communicated the impact of the phase-out on Regina’s credit amount. A portion of that response follows: “Given your income level, I want to clarify the specifics of how the phaseout affects your eligibility and the resulting credit amount. The maximum possible credit for the California Young Child Tax Credit is $1,083. However, due to your income being within the aforementioned phase-out range, your eligibility is adjusted accordingly. Based on your income and the phaseout rules, you are eligible for a total annual refundable credit of $434. This means that when you file your year-end tax return, you can claim this credit, which will contribute to reducing your tax liability or increasing your refund. While this amount is less than the maximum credit, it still provides a valuable benefit to your financial situation.” (ChatGPT 3.5, OpenAI, https://chat.openai.com/).
Bing Chat: Bing produced a similarly cogent, considerate, and professional e-mail. A portion of that response follows: “Based on the information you provided, it appears that you are subject to a phaseout because your income is between $25,000 and $30,000. As a result, you are eligible for a total annual refundable credit of $434 on your year-end tax return, rather than the maximum possible credit of $1,083. I understand that this may be disappointing news, but please know that this credit is still a valuable benefit that can help reduce your tax liability. If you have any questions or concerns, please do not hesitate to contact me.” (Bing Chat, Microsoft, https://www.bing.com/search/).
Google Bard: Bard’s response was very similar to that provided by the other two LLMs, but it provided incorrect information about the age requirement for a qualifying child in California. A portion of its response follows: “I’m writing to you today to discuss the California Young Child Tax Credit (YCTC) that you may be eligible for. The YCTC provides a refundable credit of up to $1,083 per eligible tax return for families with a qualifying child under the age of 6. [sic] I understand that your income is between $25,000 and $30,000, which means that you are subject to the YCTC phaseout. This means that your credit will be reduced as your income increases. In your case, your credit will be reduced to a total annual refundable credit of $434. I know that this may be disappointing, but I want to assure you that you are still eligible for a significant credit. The YCTC can help to offset the cost of raising a young child, and it can also help to make your tax return more affordable.” (Bard, Google, https://bard.google.com/).
The inaccuracy in Bard’s response (the age limit for a “qualifying child” according to the California YCTC in 2021 was 19 if not a full-time student and 24 if a full-time student) emphasizes the need for Principles 2 and 6. It is critical that CPAs verify the accuracy of each subtask completed by an LLM and apply their own domain expertise. This weakness aside, each of the LLMs produced an effective first draft of the client email in a matter of seconds, illustrating significant efficiency.
Since beginning our research on ChatGPT, the authors have been asked to provide a summary of the usefulness and impact of LLMs in the domain of accounting. Some academics and professionals that we spoke with wonder if LLMs like ChatGPT are a panacea—after all, there have been widespread reports about its remarkable capabilities—while others dismiss LLMs as unreliable, unhelpful, and overblown. The authors believe that neither extreme perspective is accurate; rather, we believe that CPAs who appropriately incorporate LLMs into their work can improve both their efficiency and effectiveness. Appropriately incorporating LLMs into accounting requires a careful, strategic approach. Therefore, CPAs are encouraged to utilize the principles provided in this article to appropriately leverage LLMs in their work.

The CPA Journal is a publication of the New York State Society of CPAs, and is internationally recognized as an outstanding, technical-refereed publication for accounting practitioners, educators, and other financial professionals all over the globe. Edited by CPAs for CPAs, it aims to provide accounting and other financial professionals with the information and analysis they need to succeed in today’s business environment.
The CPA Journal
14 Wall St. 19th Floor
New York, NY 10005
CPAJ-Editors@nysscpa.org
Thomson Reuters Checkpoint
SmartBrief
View the NYSSCPA privacy policy

source

Six Principles for the Effective Use of Artificial Intelligence Large Language Models – The CPA Journal

Six Principles for the Effective Use of Artificial Intelligence Large Language Models – The CPA Journal

Jesse

https://playwithchatgtp.com