Artificial Intelligence: Can generative AI provide safe and accurate answers to questions about substance use?
Generative artificial intelligence (AI) tools like Chat-GPT offer the potential to provide anonymous, real-time support from the comfort of one’s home, with the click of a mouse or the tap of a smartphone. But how safe and accurate are these tools in addressing real-world alcohol and drug-related questions?
Despite available empirically-supported services for substance use disorder, including therapy (e.g., cognitive-behavioral therapy), FDA-approved medications (e.g., naltrexone, acamprosate, buprenorphine, etc.), and community-based supports (e.g., mutual-help group involvement), there are many opportunities to better engage individuals in treatment. Difficulties attracting people to treatment include, but are not limited to, structural barriers such as stigma preventing people from seeking help and deficiencies in healthcare resources. For others, the relatively mild degree of consequences they have experienced may suggest they either do not want to change their substance use or may want to change but do not think they need assistance in doing so. Digital technologies, including generative artificial intelligence (AI), may help to address some of these barriers by offering access to anonymous, real-time support from the comfort of one’s own home. Generative AI is trained on existing data, using that data to “learn” how to create new data that resembles the existing data but in novel, unique ways, responding for example to user queries.
Recent advances in generative AI have resulted in widespread public availability of large language model-based chatbots (e.g., Chat-GPT). Chat-GPT is capable of generating human-like text and engaging in real-time conversational interactions and already has more than a hundred million active users. Most Americans look for health information online, including answers to their sensitive health-related questions (e.g., searching for substance use treatment). While most have traditionally relied on popular search engines like Google to seek out online health information, generative AI tools like Chat-GPT may offer an alternative for individuals seeking answers to their health-related queries, including those related to substance use disorder.
Generative AI has been increasingly tested in public health and medical contexts, performing equal or superior to physicians across a range of clinical tasks, including diagnostic accuracy and clinical text summarization. However, applying generative AI to substance use disorder contexts requires careful consideration and has not been thoroughly investigated. This study sought to evaluate whether publicly available generative AI, such as Chat-GPT, can safely and accurately respond to real-world substance use and recovery-related questions, in part, by asking clinicians to rate its responses.
HOW WAS THIS STUDY CONDUCTED?
The researchers began by creating a list of real-world substance use questions sourced from Reddit, an anonymous, forum-based social media site where users can post and respond to topics. Reddit hosts subforums, called “subreddits”, which focus on specific topics, such as recovery from alcohol use disorder. The researchers aimed to develop questions related to substance use disorder and recovery across three substances: alcohol, cannabis, and opioids. To do this, they selected one popular subreddit for each substance (r/Opiates for opioids, r/Leaves for cannabis, and r/stopdrinking for alcohol) and extracted 50 recent posts containing questions from each subreddit (150 total posts). The researchers then reviewed the posts and synthesized a final list of 25 questions for each substance, resulting in a total of 75 questions. See below for examples.
The researchers sought to evaluate the ability of two generative AI models – Open AI’s ChatGPT-4 and Meta’s LLaMa-2 – to respond to real-world substance use-related questions. The list of 75 questions was posed to each model and responses were recorded and presented to clinicians with training in substance use and recovery for evaluation. Any text indicating the responses were generated by AI was removed and clinicians were not informed that responses were AI generated. A team of 7 clinicians “employed at a premier substance use treatment research facility” rated the responses based on the following criteria:
Three clinicians rated each response, resulting in a total of 450 ratings. Of note, clinicians were not told they were rating AI responses, to reduce any potential bias in their ratings for or against AI. Using clinician ratings, the researchers evaluated the overall quality of each generative AI model, examined differences across substance types (alcohol, cannabis, and opioids), and compared the two generative AI models.
The researchers also qualitatively explored the generative AI models’ potential to generate inconsistent and potentially harmful responses (e.g., that could recommend risky behaviors). To do this, they had a substance use counselor – ostensibly someone separate from the 7 clinicians that rating responses though this is not specified – review responses for any inaccurate, nonfactual, or dangerous content.
Additionally, the researchers repeated and rephrased questions to determine if ChatGPT-4 produced inconsistent responses. For this task, they selected two questions where accurate responses were critical for health and safety. The first question was related to abruptly quitting prolonged intravenous heroin use and asked if it was safe to “detox” at home (i.e., not under medical supervision). The question was presented to the model alongside additional context from the original post. The researchers then prompted ChatGPT-4 to rephrase the context 100 times while keeping the question the same. Then, they prompted ChatGPT-4 to respond to the 100 rephrased versions of the post with a “Yes” or “No” response regarding whether detoxing at home was safe and counted the number of “Yes” and “No” responses. For the second analysis, the researchers asked which drug, Lorazepam or Clonazepam, is better for withdrawal (presumably in context of alcohol withdrawal though this is not specified), switching the order of the drugs and prompting the model 50 times for each version (100 total prompts). They then counted the number of times ChatGPT-4 responded with either Clonazepam or Lorazepam as the better option.
WHAT DID THIS STUDY FIND?
Generative AI responses were generally rated as high quality by clinicians
Clinician ratings for both generative AI models indicated that responses to substance use questions were generally rated highly. Accounting for all three substances (alcohol, cannabis, and opioids), ChatGPT-4 received an average rating of 2.75 (out of 3) for adequacy, 4.38 (out of 5) for appropriateness, and 3.92 (out of 5) for overall quality. LLaMa-2 received average ratings of 2.88 for adequacy, 4.45 for appropriateness, and 4.18 for overall quality. The overall quality rating for LlaMa-2 averaged across all three substances was statistically significantly higher than the overall quality rating for ChatGPT-4. Comparing the two, ChatGPT-4 and LLaMa-2 received similar ratings for alcohol, but LLaMa-2 received significantly higher overall quality ratings on responses to cannabis- and opioid-related questions.
There were cases when generative AI models provided potentially harmful advice or misinformation
The qualitative fact checking analysis of the generative AI responses by a substance use counselor uncovered instances of nonfactual and potentially harmful suggestions. In some instances, the models’ provided resources that did not exist (e.g., helplines) or referenced non-existent peer-reviewed scientific publications. In cases where questions asked about detoxing from fentanyl at home, the model did not explicitly advise the questioner against doing so. Also, there were cases when suicidal thoughts were mentioned in questions, but the models did not suggest for the individual to seek professional help.
There was some inconsistency in generative AI responses
For the rephrased 100 questions asking about whether it is safe to detox at home when abruptly quitting heroin, ChatGPT-4 responded with “No” more than three out of four times (77%) and with “Yes” the remaining 23%. Regarding the rephrased 100 questions asking if Lorazepam or Clonazepam was better for withdrawal, ChatGPT-4 chose Lorazepam 32% of the time, Clonazepam 17% of the time, and refused to answer more than half the time (51%).
WHAT ARE THE IMPLICATIONS OF THE STUDY FINDINGS?
Findings from this study highlight both the promise and potential pitfalls of using generative AI such as ChatGPT-4 and LLaMa-2 for information regarding substance use and recovery. While clinicians tended to rate the quality of these generative AI models’ responses to substance use and recovery questions as high, there were also instances in which responses were inaccurate or potentially harmful.
In addition to the high ratings clinicians gave to the generative AI models’ responses to substance use questions, their open-ended feedback included descriptions such as “warm,” “empathetic,” “validating,” and “personalized.” These findings align with prior studies in which licensed healthcare providers rated Chat-GPT as providing high quality and empathetic responses to patient questions posted to an online forum. This underscores the potential for generative AI tools to serve as a scalable resource for individuals seeking health-related information and support online, including related to substance use disorder.
On the other hand, findings from this study add to existing concerns about the safe and ethical use of generative AI in health-related contexts, particularly those related to mental health and substance use. This study identified instances where generative AI models provided inconsistent, inaccurate, and potentially harmful suggestions, such as providing non-existent resources or failing to clearly warn against detoxing from long-term heroin use at home. It is important to note that in the rephrasing and reprompting analysis, the researchers used a temperature setting of 1.5, which resulted in inconsistent responses. However, the default temperature setting for ChatGPT-4 is 1.0. Higher chatbot “temperatures” are modifiable settings where higher temperatures produce more variable responses. The researchers opted for a higher temperature setting to generate more diverse answers, but this may have introduced bias, as lower temperature settings typically produce more consistent outputs. Nevertheless, these findings are consistent with broader concerns in the field of AI and medicine regarding the tendency for generative AI models to sometimes “hallucinate” (i.e., produce false or misleading information). Human clinicians do not always give 100% safe and accurate advice either, but given the noted potential risks associated with providing inaccurate or misleading information to individuals affected by substance use disorder, there is an urgent need to establish frameworks for the safe and ethical adoption of AI into substance use and recovery settings.
It is important to note that some individuals affected by substance use disorder – and those working with them clinically or in community settings – are already using generative AI tools. One in three American Psychiatric Association-affiliated psychiatrists (33%) reported using generative AI like Chat-GPT to “assist with answering clinical questions,” and three in four psychiatrists (75%) somewhat agreed/agreed that “the majority of their patients will consult [generative AI] tools before first seeing a doctor.” To enhance the quality and safety of these generative AI tools already being widely used by the public, trusted taxpayer-funded organizations like the National Institute on Drug Abuse and National Institute on Alcohol Abuse and Alcoholism could work alongside industry partners to fine-tune generative AI models using high quality substance use and recovery-related training data with expert oversight. These fine-tuned models could be validated for use in clinical settings to ensure their safety prior to their release into the real world. Additionally, there will be a need to establish robust measures for ongoing monitoring of generative AI models’ accuracy and safety and to identify biases that may arise from the training data. While generative AI tools hold immense potential for scaling access to high-quality substance use and recovery information and support, careful consideration on how best to proceed with developing and deploying these tools into the real world will be of the utmost importance to maximize benefits and mitigate harms.
The study evaluated only two generative AI models (ChatGPT-4 and LLaMa-2) and was limited to a small set of questions sourced from online forums. It also did not account for follow-up interactions in model prompting. Thus, results are not to be generalized to overall performance of generative AI models in substance use and recovery contexts.
This study did not explore the effects of adjusting “temperature” setting parameters, which influences the variability of AI-generated responses – where higher temperatures produce more variability in responses. In the rephrasing and reprompting analysis, the researchers used a temperature setting of 1.5, which resulted in inconsistent responses. However, the default temperature setting for ChatGPT-4 is 1.0. The researchers opted for a higher temperature setting to generate more diverse answers, but this may have introduced bias, as lower temperature settings typically produce more consistent outputs. Future studies could investigate whether adjusting temperature settings in substance use disorder contexts can reduce inaccuracies and improve response consistency.
BOTTOM LINE
Two generative AI models, ChatGPT-4 and LLaMa-2, were generally rated by clinicians as providing high-quality responses to real-world substance use and recovery questions. These tools may hold promise for increasing access to high quality information and empathetic support for individuals in or seeking recovery from substance use disorder. However, the generative AI models also produced inconsistent, inaccurate, and potentially harmful responses at times, such as recommending non-existing resources (e.g., helplines) and failing to direct individuals with suicidal thoughts to seek professional help. To maximize the potential public health benefits of generative AI tools while minimizing risks, stringent oversight will be essential, including specialized training, clinical validation before deployment in real world settings, and ongoing monitoring of accuracy and safety.
For individuals and families seeking recovery: Generative AI tools, such as ChatGPT-4, may offer easy access to high quality information on substance use disorder and recovery. However, these tools sometimes provide inaccurate or potentially harmful misinformation. If seeking any information or suggestions from generative AI tools, it is important to use these tools as a starting point and to always discuss any suggestions made by these tools with a healthcare professional. Generative AI tools should not replace professional medical advice, especially in sensitive contexts like substance use disorder.
For treatment professionals and treatment systems: Treatment professionals should be mindful that many individuals seeking support for substance use disorder and recovery are turning to online tools like generative AI. It is important to discuss the benefits and risks of these tools with patients, including their potential to provide both high quality information as well as potentially harmful misinformation. It will also be important moving forward to encourage patients to consult with a professional both before and after using these publicly accessible tools, as these tools are not intended to replace professional medical advice.
For scientists: Further research is needed to fine-tune generative AI models, such as ChatGPT-4, for particular tasks within the substance use disorder and recovery field. By combining the capabilities of these models with substance use disorder and recovery domain expertise, rigorous testing and validation, and continuous human oversight, generative AI tools could enhance access to recovery support. It will also be essential for ongoing studies to monitor and address biases in models’ training data, especially given the history of stigma surrounding substance use disorder, which can prevent or delay individuals from seeking help.
For policy makers: Policy makers can support research on generative AI in substance use and recovery settings by allocating funding to projects focused on developing and clinically validating specialized models designed to support individuals in or seeking recovery. Additionally, regulations governing the use of generative AI in healthcare settings, particularly in sensitive areas like substance use disorder and recovery, are urgently needed. Ensuring that generative AI tools undergo rigorous clinical validation before widespread deployment and ongoing monitoring of accuracy and safety can help to ensure they improve patient outcomes and do not cause harm.
Despite available empirically-supported services for substance use disorder, including therapy (e.g., cognitive-behavioral therapy), FDA-approved medications (e.g., naltrexone, acamprosate, buprenorphine, etc.), and community-based supports (e.g., mutual-help group involvement), there are many opportunities to better engage individuals in treatment. Difficulties attracting people to treatment include, but are not limited to, structural barriers such as stigma preventing people from seeking help and deficiencies in healthcare resources. For others, the relatively mild degree of consequences they have experienced may suggest they either do not want to change their substance use or may want to change but do not think they need assistance in doing so. Digital technologies, including generative artificial intelligence (AI), may help to address some of these barriers by offering access to anonymous, real-time support from the comfort of one’s own home. Generative AI is trained on existing data, using that data to “learn” how to create new data that resembles the existing data but in novel, unique ways, responding for example to user queries.
Recent advances in generative AI have resulted in widespread public availability of large language model-based chatbots (e.g., Chat-GPT). Chat-GPT is capable of generating human-like text and engaging in real-time conversational interactions and already has more than a hundred million active users. Most Americans look for health information online, including answers to their sensitive health-related questions (e.g., searching for substance use treatment). While most have traditionally relied on popular search engines like Google to seek out online health information, generative AI tools like Chat-GPT may offer an alternative for individuals seeking answers to their health-related queries, including those related to substance use disorder.
Generative AI has been increasingly tested in public health and medical contexts, performing equal or superior to physicians across a range of clinical tasks, including diagnostic accuracy and clinical text summarization. However, applying generative AI to substance use disorder contexts requires careful consideration and has not been thoroughly investigated. This study sought to evaluate whether publicly available generative AI, such as Chat-GPT, can safely and accurately respond to real-world substance use and recovery-related questions, in part, by asking clinicians to rate its responses.
HOW WAS THIS STUDY CONDUCTED?
The researchers began by creating a list of real-world substance use questions sourced from Reddit, an anonymous, forum-based social media site where users can post and respond to topics. Reddit hosts subforums, called “subreddits”, which focus on specific topics, such as recovery from alcohol use disorder. The researchers aimed to develop questions related to substance use disorder and recovery across three substances: alcohol, cannabis, and opioids. To do this, they selected one popular subreddit for each substance (r/Opiates for opioids, r/Leaves for cannabis, and r/stopdrinking for alcohol) and extracted 50 recent posts containing questions from each subreddit (150 total posts). The researchers then reviewed the posts and synthesized a final list of 25 questions for each substance, resulting in a total of 75 questions. See below for examples.
The researchers sought to evaluate the ability of two generative AI models – Open AI’s ChatGPT-4 and Meta’s LLaMa-2 – to respond to real-world substance use-related questions. The list of 75 questions was posed to each model and responses were recorded and presented to clinicians with training in substance use and recovery for evaluation. Any text indicating the responses were generated by AI was removed and clinicians were not informed that responses were AI generated. A team of 7 clinicians “employed at a premier substance use treatment research facility” rated the responses based on the following criteria:
Three clinicians rated each response, resulting in a total of 450 ratings. Of note, clinicians were not told they were rating AI responses, to reduce any potential bias in their ratings for or against AI. Using clinician ratings, the researchers evaluated the overall quality of each generative AI model, examined differences across substance types (alcohol, cannabis, and opioids), and compared the two generative AI models.
The researchers also qualitatively explored the generative AI models’ potential to generate inconsistent and potentially harmful responses (e.g., that could recommend risky behaviors). To do this, they had a substance use counselor – ostensibly someone separate from the 7 clinicians that rating responses though this is not specified – review responses for any inaccurate, nonfactual, or dangerous content.
Additionally, the researchers repeated and rephrased questions to determine if ChatGPT-4 produced inconsistent responses. For this task, they selected two questions where accurate responses were critical for health and safety. The first question was related to abruptly quitting prolonged intravenous heroin use and asked if it was safe to “detox” at home (i.e., not under medical supervision). The question was presented to the model alongside additional context from the original post. The researchers then prompted ChatGPT-4 to rephrase the context 100 times while keeping the question the same. Then, they prompted ChatGPT-4 to respond to the 100 rephrased versions of the post with a “Yes” or “No” response regarding whether detoxing at home was safe and counted the number of “Yes” and “No” responses. For the second analysis, the researchers asked which drug, Lorazepam or Clonazepam, is better for withdrawal (presumably in context of alcohol withdrawal though this is not specified), switching the order of the drugs and prompting the model 50 times for each version (100 total prompts). They then counted the number of times ChatGPT-4 responded with either Clonazepam or Lorazepam as the better option.
WHAT DID THIS STUDY FIND?
Generative AI responses were generally rated as high quality by clinicians
Clinician ratings for both generative AI models indicated that responses to substance use questions were generally rated highly. Accounting for all three substances (alcohol, cannabis, and opioids), ChatGPT-4 received an average rating of 2.75 (out of 3) for adequacy, 4.38 (out of 5) for appropriateness, and 3.92 (out of 5) for overall quality. LLaMa-2 received average ratings of 2.88 for adequacy, 4.45 for appropriateness, and 4.18 for overall quality. The overall quality rating for LlaMa-2 averaged across all three substances was statistically significantly higher than the overall quality rating for ChatGPT-4. Comparing the two, ChatGPT-4 and LLaMa-2 received similar ratings for alcohol, but LLaMa-2 received significantly higher overall quality ratings on responses to cannabis- and opioid-related questions.
There were cases when generative AI models provided potentially harmful advice or misinformation
The qualitative fact checking analysis of the generative AI responses by a substance use counselor uncovered instances of nonfactual and potentially harmful suggestions. In some instances, the models’ provided resources that did not exist (e.g., helplines) or referenced non-existent peer-reviewed scientific publications. In cases where questions asked about detoxing from fentanyl at home, the model did not explicitly advise the questioner against doing so. Also, there were cases when suicidal thoughts were mentioned in questions, but the models did not suggest for the individual to seek professional help.
There was some inconsistency in generative AI responses
For the rephrased 100 questions asking about whether it is safe to detox at home when abruptly quitting heroin, ChatGPT-4 responded with “No” more than three out of four times (77%) and with “Yes” the remaining 23%. Regarding the rephrased 100 questions asking if Lorazepam or Clonazepam was better for withdrawal, ChatGPT-4 chose Lorazepam 32% of the time, Clonazepam 17% of the time, and refused to answer more than half the time (51%).
WHAT ARE THE IMPLICATIONS OF THE STUDY FINDINGS?
Findings from this study highlight both the promise and potential pitfalls of using generative AI such as ChatGPT-4 and LLaMa-2 for information regarding substance use and recovery. While clinicians tended to rate the quality of these generative AI models’ responses to substance use and recovery questions as high, there were also instances in which responses were inaccurate or potentially harmful.
In addition to the high ratings clinicians gave to the generative AI models’ responses to substance use questions, their open-ended feedback included descriptions such as “warm,” “empathetic,” “validating,” and “personalized.” These findings align with prior studies in which licensed healthcare providers rated Chat-GPT as providing high quality and empathetic responses to patient questions posted to an online forum. This underscores the potential for generative AI tools to serve as a scalable resource for individuals seeking health-related information and support online, including related to substance use disorder.
On the other hand, findings from this study add to existing concerns about the safe and ethical use of generative AI in health-related contexts, particularly those related to mental health and substance use. This study identified instances where generative AI models provided inconsistent, inaccurate, and potentially harmful suggestions, such as providing non-existent resources or failing to clearly warn against detoxing from long-term heroin use at home. It is important to note that in the rephrasing and reprompting analysis, the researchers used a temperature setting of 1.5, which resulted in inconsistent responses. However, the default temperature setting for ChatGPT-4 is 1.0. Higher chatbot “temperatures” are modifiable settings where higher temperatures produce more variable responses. The researchers opted for a higher temperature setting to generate more diverse answers, but this may have introduced bias, as lower temperature settings typically produce more consistent outputs. Nevertheless, these findings are consistent with broader concerns in the field of AI and medicine regarding the tendency for generative AI models to sometimes “hallucinate” (i.e., produce false or misleading information). Human clinicians do not always give 100% safe and accurate advice either, but given the noted potential risks associated with providing inaccurate or misleading information to individuals affected by substance use disorder, there is an urgent need to establish frameworks for the safe and ethical adoption of AI into substance use and recovery settings.
It is important to note that some individuals affected by substance use disorder – and those working with them clinically or in community settings – are already using generative AI tools. One in three American Psychiatric Association-affiliated psychiatrists (33%) reported using generative AI like Chat-GPT to “assist with answering clinical questions,” and three in four psychiatrists (75%) somewhat agreed/agreed that “the majority of their patients will consult [generative AI] tools before first seeing a doctor.” To enhance the quality and safety of these generative AI tools already being widely used by the public, trusted taxpayer-funded organizations like the National Institute on Drug Abuse and National Institute on Alcohol Abuse and Alcoholism could work alongside industry partners to fine-tune generative AI models using high quality substance use and recovery-related training data with expert oversight. These fine-tuned models could be validated for use in clinical settings to ensure their safety prior to their release into the real world. Additionally, there will be a need to establish robust measures for ongoing monitoring of generative AI models’ accuracy and safety and to identify biases that may arise from the training data. While generative AI tools hold immense potential for scaling access to high-quality substance use and recovery information and support, careful consideration on how best to proceed with developing and deploying these tools into the real world will be of the utmost importance to maximize benefits and mitigate harms.
The study evaluated only two generative AI models (ChatGPT-4 and LLaMa-2) and was limited to a small set of questions sourced from online forums. It also did not account for follow-up interactions in model prompting. Thus, results are not to be generalized to overall performance of generative AI models in substance use and recovery contexts.
This study did not explore the effects of adjusting “temperature” setting parameters, which influences the variability of AI-generated responses – where higher temperatures produce more variability in responses. In the rephrasing and reprompting analysis, the researchers used a temperature setting of 1.5, which resulted in inconsistent responses. However, the default temperature setting for ChatGPT-4 is 1.0. The researchers opted for a higher temperature setting to generate more diverse answers, but this may have introduced bias, as lower temperature settings typically produce more consistent outputs. Future studies could investigate whether adjusting temperature settings in substance use disorder contexts can reduce inaccuracies and improve response consistency.
BOTTOM LINE
Two generative AI models, ChatGPT-4 and LLaMa-2, were generally rated by clinicians as providing high-quality responses to real-world substance use and recovery questions. These tools may hold promise for increasing access to high quality information and empathetic support for individuals in or seeking recovery from substance use disorder. However, the generative AI models also produced inconsistent, inaccurate, and potentially harmful responses at times, such as recommending non-existing resources (e.g., helplines) and failing to direct individuals with suicidal thoughts to seek professional help. To maximize the potential public health benefits of generative AI tools while minimizing risks, stringent oversight will be essential, including specialized training, clinical validation before deployment in real world settings, and ongoing monitoring of accuracy and safety.
For individuals and families seeking recovery: Generative AI tools, such as ChatGPT-4, may offer easy access to high quality information on substance use disorder and recovery. However, these tools sometimes provide inaccurate or potentially harmful misinformation. If seeking any information or suggestions from generative AI tools, it is important to use these tools as a starting point and to always discuss any suggestions made by these tools with a healthcare professional. Generative AI tools should not replace professional medical advice, especially in sensitive contexts like substance use disorder.
For treatment professionals and treatment systems: Treatment professionals should be mindful that many individuals seeking support for substance use disorder and recovery are turning to online tools like generative AI. It is important to discuss the benefits and risks of these tools with patients, including their potential to provide both high quality information as well as potentially harmful misinformation. It will also be important moving forward to encourage patients to consult with a professional both before and after using these publicly accessible tools, as these tools are not intended to replace professional medical advice.
For scientists: Further research is needed to fine-tune generative AI models, such as ChatGPT-4, for particular tasks within the substance use disorder and recovery field. By combining the capabilities of these models with substance use disorder and recovery domain expertise, rigorous testing and validation, and continuous human oversight, generative AI tools could enhance access to recovery support. It will also be essential for ongoing studies to monitor and address biases in models’ training data, especially given the history of stigma surrounding substance use disorder, which can prevent or delay individuals from seeking help.
For policy makers: Policy makers can support research on generative AI in substance use and recovery settings by allocating funding to projects focused on developing and clinically validating specialized models designed to support individuals in or seeking recovery. Additionally, regulations governing the use of generative AI in healthcare settings, particularly in sensitive areas like substance use disorder and recovery, are urgently needed. Ensuring that generative AI tools undergo rigorous clinical validation before widespread deployment and ongoing monitoring of accuracy and safety can help to ensure they improve patient outcomes and do not cause harm.
Despite available empirically-supported services for substance use disorder, including therapy (e.g., cognitive-behavioral therapy), FDA-approved medications (e.g., naltrexone, acamprosate, buprenorphine, etc.), and community-based supports (e.g., mutual-help group involvement), there are many opportunities to better engage individuals in treatment. Difficulties attracting people to treatment include, but are not limited to, structural barriers such as stigma preventing people from seeking help and deficiencies in healthcare resources. For others, the relatively mild degree of consequences they have experienced may suggest they either do not want to change their substance use or may want to change but do not think they need assistance in doing so. Digital technologies, including generative artificial intelligence (AI), may help to address some of these barriers by offering access to anonymous, real-time support from the comfort of one’s own home. Generative AI is trained on existing data, using that data to “learn” how to create new data that resembles the existing data but in novel, unique ways, responding for example to user queries.
Recent advances in generative AI have resulted in widespread public availability of large language model-based chatbots (e.g., Chat-GPT). Chat-GPT is capable of generating human-like text and engaging in real-time conversational interactions and already has more than a hundred million active users. Most Americans look for health information online, including answers to their sensitive health-related questions (e.g., searching for substance use treatment). While most have traditionally relied on popular search engines like Google to seek out online health information, generative AI tools like Chat-GPT may offer an alternative for individuals seeking answers to their health-related queries, including those related to substance use disorder.
Generative AI has been increasingly tested in public health and medical contexts, performing equal or superior to physicians across a range of clinical tasks, including diagnostic accuracy and clinical text summarization. However, applying generative AI to substance use disorder contexts requires careful consideration and has not been thoroughly investigated. This study sought to evaluate whether publicly available generative AI, such as Chat-GPT, can safely and accurately respond to real-world substance use and recovery-related questions, in part, by asking clinicians to rate its responses.
HOW WAS THIS STUDY CONDUCTED?
The researchers began by creating a list of real-world substance use questions sourced from Reddit, an anonymous, forum-based social media site where users can post and respond to topics. Reddit hosts subforums, called “subreddits”, which focus on specific topics, such as recovery from alcohol use disorder. The researchers aimed to develop questions related to substance use disorder and recovery across three substances: alcohol, cannabis, and opioids. To do this, they selected one popular subreddit for each substance (r/Opiates for opioids, r/Leaves for cannabis, and r/stopdrinking for alcohol) and extracted 50 recent posts containing questions from each subreddit (150 total posts). The researchers then reviewed the posts and synthesized a final list of 25 questions for each substance, resulting in a total of 75 questions. See below for examples.
The researchers sought to evaluate the ability of two generative AI models – Open AI’s ChatGPT-4 and Meta’s LLaMa-2 – to respond to real-world substance use-related questions. The list of 75 questions was posed to each model and responses were recorded and presented to clinicians with training in substance use and recovery for evaluation. Any text indicating the responses were generated by AI was removed and clinicians were not informed that responses were AI generated. A team of 7 clinicians “employed at a premier substance use treatment research facility” rated the responses based on the following criteria:
Three clinicians rated each response, resulting in a total of 450 ratings. Of note, clinicians were not told they were rating AI responses, to reduce any potential bias in their ratings for or against AI. Using clinician ratings, the researchers evaluated the overall quality of each generative AI model, examined differences across substance types (alcohol, cannabis, and opioids), and compared the two generative AI models.
The researchers also qualitatively explored the generative AI models’ potential to generate inconsistent and potentially harmful responses (e.g., that could recommend risky behaviors). To do this, they had a substance use counselor – ostensibly someone separate from the 7 clinicians that rating responses though this is not specified – review responses for any inaccurate, nonfactual, or dangerous content.
Additionally, the researchers repeated and rephrased questions to determine if ChatGPT-4 produced inconsistent responses. For this task, they selected two questions where accurate responses were critical for health and safety. The first question was related to abruptly quitting prolonged intravenous heroin use and asked if it was safe to “detox” at home (i.e., not under medical supervision). The question was presented to the model alongside additional context from the original post. The researchers then prompted ChatGPT-4 to rephrase the context 100 times while keeping the question the same. Then, they prompted ChatGPT-4 to respond to the 100 rephrased versions of the post with a “Yes” or “No” response regarding whether detoxing at home was safe and counted the number of “Yes” and “No” responses. For the second analysis, the researchers asked which drug, Lorazepam or Clonazepam, is better for withdrawal (presumably in context of alcohol withdrawal though this is not specified), switching the order of the drugs and prompting the model 50 times for each version (100 total prompts). They then counted the number of times ChatGPT-4 responded with either Clonazepam or Lorazepam as the better option.
WHAT DID THIS STUDY FIND?
Generative AI responses were generally rated as high quality by clinicians
Clinician ratings for both generative AI models indicated that responses to substance use questions were generally rated highly. Accounting for all three substances (alcohol, cannabis, and opioids), ChatGPT-4 received an average rating of 2.75 (out of 3) for adequacy, 4.38 (out of 5) for appropriateness, and 3.92 (out of 5) for overall quality. LLaMa-2 received average ratings of 2.88 for adequacy, 4.45 for appropriateness, and 4.18 for overall quality. The overall quality rating for LlaMa-2 averaged across all three substances was statistically significantly higher than the overall quality rating for ChatGPT-4. Comparing the two, ChatGPT-4 and LLaMa-2 received similar ratings for alcohol, but LLaMa-2 received significantly higher overall quality ratings on responses to cannabis- and opioid-related questions.
There were cases when generative AI models provided potentially harmful advice or misinformation
The qualitative fact checking analysis of the generative AI responses by a substance use counselor uncovered instances of nonfactual and potentially harmful suggestions. In some instances, the models’ provided resources that did not exist (e.g., helplines) or referenced non-existent peer-reviewed scientific publications. In cases where questions asked about detoxing from fentanyl at home, the model did not explicitly advise the questioner against doing so. Also, there were cases when suicidal thoughts were mentioned in questions, but the models did not suggest for the individual to seek professional help.
There was some inconsistency in generative AI responses
For the rephrased 100 questions asking about whether it is safe to detox at home when abruptly quitting heroin, ChatGPT-4 responded with “No” more than three out of four times (77%) and with “Yes” the remaining 23%. Regarding the rephrased 100 questions asking if Lorazepam or Clonazepam was better for withdrawal, ChatGPT-4 chose Lorazepam 32% of the time, Clonazepam 17% of the time, and refused to answer more than half the time (51%).
WHAT ARE THE IMPLICATIONS OF THE STUDY FINDINGS?
Findings from this study highlight both the promise and potential pitfalls of using generative AI such as ChatGPT-4 and LLaMa-2 for information regarding substance use and recovery. While clinicians tended to rate the quality of these generative AI models’ responses to substance use and recovery questions as high, there were also instances in which responses were inaccurate or potentially harmful.
In addition to the high ratings clinicians gave to the generative AI models’ responses to substance use questions, their open-ended feedback included descriptions such as “warm,” “empathetic,” “validating,” and “personalized.” These findings align with prior studies in which licensed healthcare providers rated Chat-GPT as providing high quality and empathetic responses to patient questions posted to an online forum. This underscores the potential for generative AI tools to serve as a scalable resource for individuals seeking health-related information and support online, including related to substance use disorder.
On the other hand, findings from this study add to existing concerns about the safe and ethical use of generative AI in health-related contexts, particularly those related to mental health and substance use. This study identified instances where generative AI models provided inconsistent, inaccurate, and potentially harmful suggestions, such as providing non-existent resources or failing to clearly warn against detoxing from long-term heroin use at home. It is important to note that in the rephrasing and reprompting analysis, the researchers used a temperature setting of 1.5, which resulted in inconsistent responses. However, the default temperature setting for ChatGPT-4 is 1.0. Higher chatbot “temperatures” are modifiable settings where higher temperatures produce more variable responses. The researchers opted for a higher temperature setting to generate more diverse answers, but this may have introduced bias, as lower temperature settings typically produce more consistent outputs. Nevertheless, these findings are consistent with broader concerns in the field of AI and medicine regarding the tendency for generative AI models to sometimes “hallucinate” (i.e., produce false or misleading information). Human clinicians do not always give 100% safe and accurate advice either, but given the noted potential risks associated with providing inaccurate or misleading information to individuals affected by substance use disorder, there is an urgent need to establish frameworks for the safe and ethical adoption of AI into substance use and recovery settings.
It is important to note that some individuals affected by substance use disorder – and those working with them clinically or in community settings – are already using generative AI tools. One in three American Psychiatric Association-affiliated psychiatrists (33%) reported using generative AI like Chat-GPT to “assist with answering clinical questions,” and three in four psychiatrists (75%) somewhat agreed/agreed that “the majority of their patients will consult [generative AI] tools before first seeing a doctor.” To enhance the quality and safety of these generative AI tools already being widely used by the public, trusted taxpayer-funded organizations like the National Institute on Drug Abuse and National Institute on Alcohol Abuse and Alcoholism could work alongside industry partners to fine-tune generative AI models using high quality substance use and recovery-related training data with expert oversight. These fine-tuned models could be validated for use in clinical settings to ensure their safety prior to their release into the real world. Additionally, there will be a need to establish robust measures for ongoing monitoring of generative AI models’ accuracy and safety and to identify biases that may arise from the training data. While generative AI tools hold immense potential for scaling access to high-quality substance use and recovery information and support, careful consideration on how best to proceed with developing and deploying these tools into the real world will be of the utmost importance to maximize benefits and mitigate harms.
The study evaluated only two generative AI models (ChatGPT-4 and LLaMa-2) and was limited to a small set of questions sourced from online forums. It also did not account for follow-up interactions in model prompting. Thus, results are not to be generalized to overall performance of generative AI models in substance use and recovery contexts.
This study did not explore the effects of adjusting “temperature” setting parameters, which influences the variability of AI-generated responses – where higher temperatures produce more variability in responses. In the rephrasing and reprompting analysis, the researchers used a temperature setting of 1.5, which resulted in inconsistent responses. However, the default temperature setting for ChatGPT-4 is 1.0. The researchers opted for a higher temperature setting to generate more diverse answers, but this may have introduced bias, as lower temperature settings typically produce more consistent outputs. Future studies could investigate whether adjusting temperature settings in substance use disorder contexts can reduce inaccuracies and improve response consistency.
BOTTOM LINE
Two generative AI models, ChatGPT-4 and LLaMa-2, were generally rated by clinicians as providing high-quality responses to real-world substance use and recovery questions. These tools may hold promise for increasing access to high quality information and empathetic support for individuals in or seeking recovery from substance use disorder. However, the generative AI models also produced inconsistent, inaccurate, and potentially harmful responses at times, such as recommending non-existing resources (e.g., helplines) and failing to direct individuals with suicidal thoughts to seek professional help. To maximize the potential public health benefits of generative AI tools while minimizing risks, stringent oversight will be essential, including specialized training, clinical validation before deployment in real world settings, and ongoing monitoring of accuracy and safety.
For individuals and families seeking recovery: Generative AI tools, such as ChatGPT-4, may offer easy access to high quality information on substance use disorder and recovery. However, these tools sometimes provide inaccurate or potentially harmful misinformation. If seeking any information or suggestions from generative AI tools, it is important to use these tools as a starting point and to always discuss any suggestions made by these tools with a healthcare professional. Generative AI tools should not replace professional medical advice, especially in sensitive contexts like substance use disorder.
For treatment professionals and treatment systems: Treatment professionals should be mindful that many individuals seeking support for substance use disorder and recovery are turning to online tools like generative AI. It is important to discuss the benefits and risks of these tools with patients, including their potential to provide both high quality information as well as potentially harmful misinformation. It will also be important moving forward to encourage patients to consult with a professional both before and after using these publicly accessible tools, as these tools are not intended to replace professional medical advice.
For scientists: Further research is needed to fine-tune generative AI models, such as ChatGPT-4, for particular tasks within the substance use disorder and recovery field. By combining the capabilities of these models with substance use disorder and recovery domain expertise, rigorous testing and validation, and continuous human oversight, generative AI tools could enhance access to recovery support. It will also be essential for ongoing studies to monitor and address biases in models’ training data, especially given the history of stigma surrounding substance use disorder, which can prevent or delay individuals from seeking help.
For policy makers: Policy makers can support research on generative AI in substance use and recovery settings by allocating funding to projects focused on developing and clinically validating specialized models designed to support individuals in or seeking recovery. Additionally, regulations governing the use of generative AI in healthcare settings, particularly in sensitive areas like substance use disorder and recovery, are urgently needed. Ensuring that generative AI tools undergo rigorous clinical validation before widespread deployment and ongoing monitoring of accuracy and safety can help to ensure they improve patient outcomes and do not cause harm.