Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at customercare@wspc.com for any enquiries.

SEARCH GUIDE  Download Search Tip PDF File

  • articleNo Access

    Generation and Validation of Teaching Examples Based on Large Language Models

    Example sentences serve as a crucial bridge for learners to master language application rules, enhance language skills, and develop a sense of language. These sentences encompass various aspects, including semantics, grammar, and pragmatics, and hold significant importance in the fields of language teaching and publishing. Large language models (LLMs) have facilitated the construction and development of generative corpora. Empowered by LLMs, example sentences are linked with linguistic elements such as parts of speech and meanings. During the generation process, both coarse-grained and fine-grained resources are fully utilized; in the screening process, relevant research findings on example sentences, errors, and corrections are extensively referenced to form screening norms. This approach results in the construction of a generative example sentence corpus that meets educational needs and maintains a high degree of standardization.

  • articleNo Access

    Security Assessment and Generation Improvement Strategies for Large Language Models

    This study evaluates the performance of mainstream large language models (LLMs) in Chinese security generation tasks, examines the potential security risks associated with these models, and proposes strategies for mitigating these risks. To this end, we developed the multidimensional security question answering (MSQA) dataset and the multidimensional security scoring criteria (MSSC). This study compares the performance of three models across six distinct security tasks. Pearson correlation analysis was conducted using GPT-4 and questionnaires, while automatic scoring was implemented using GPT-3.5-Turbo and Llama-3. Experimental results reveal that ERNIE Bot excels in ideology and ethics evaluation, ChatGPT demonstrates strong performance in assessing rumors, false information and privacy security, and Claude performs well in evaluating factual fallacies and social biases. Additionally, the fine-tuned model showed effectiveness in security scoring tasks, and the proposed Security Tips Expert (ST-GPT) successfully mitigates security risks. Despite the promising results, all models exhibit inherent security risks. Based on these findings, we recommend that both domestic and international models adhere to the legal frameworks of their respective jurisdictions, minimize AI hallucinations, continuously expand training corpora, and undergo regular updates and iterations to enhance their reliability and safety.

  • articleNo Access

    Using Peer Assessment Leveraging Large Language Models in Software Engineering Education

    This paper explores the integration of generative AI and large language models into the realm of software engineering education and training, with a specific focus on the transformation of traditional peer assessment methodologies. The motivation stems from the growing demand for innovative educational techniques that can effectively engage and empower learners in mastering Software Engineering principles. The proposed approach involves presenting students with modeling exercises solved by ChatGPT, prompting them to critically evaluate and provide constructive feedback on the generated solutions. By engaging students in a dialogue with the AI model, we aim to foster a dynamic learning environment where learners can articulate their considerations and insights, thereby enhancing their comprehension of software engineering principles, critical thinking and self evaluation skills. Preliminary results from pilot implementations indicate promising outcomes, suggesting that this approach not only enhances the quality of peer feedback but also contributes to a more interactive and engaging educational experience.

  • articleNo Access

    Enhancing Translation Validation of Compiler Transformations with Large Language Models

    This paper presents a framework that integrates Large Language Models (LLMs) into translation validation, targeting LLVM compiler transformations where formal verification tools fall short. Our framework utilizes the existing tools, like Alive2, to perform initial validation. For transformations deemed unsolvable by traditional methods, our approach leverages fine-tuned LLMs to predict soundness or unsoundness, with subsequent fuzzing applied to identify counterexamples for unsound transformations. Our approach has proven effective in complex scenarios, such as deep-learning accelerator designs, enhancing the reliability of compiler transformations.

  • articleNo Access

    LLMs: Their Past, Promise, and Problems

    Transformer-based large language models are currently at the forefront of modern artificial intelligence. Their prominence followed from the seminal paper Attention is All You Need [1]. Vaswani and his colleagues suggested placing attention mechanisms within the encoder and decoder modules of autoencoders rather than using them to focus between these two modules. In this paper we present first the seminal insights of early AI that lead to deep learning. We then describe the mathematical tools necessary for understanding the current generation of LLMs and follow this with a brief description of the transformer architecture. We then provide examples of LLMs in action and conclude with some observations of their promise and problems.

  • articleOpen Access

    Large Language Models, Computational Chemistry, and Digital Reticular Chemistry: A Perspective and Proposed Workflow

    In this article, I explore the synergy between Large Language Models (LLMs) and computational chemistry in the context of digital reticular chemistry and propose a workflow leveraging these technologies to advance research and discovery in the field. I argue that understanding the intricacies of new tools is imperative before integrating them into applications, and that the proposed workflow, though robust, merely offers a glimpse into the expansive potential and applications of this field.

  • articleOpen Access

    A Survey of Cross-Lingual Text Classification and Its Applications on Fake News Detection

    Cross-lingual text classification is a challenging task in natural language processing. The objective is to build accurate text classification models for low-resource languages by transferring the knowledge learned from high-resource languages. The task has been studied since 2003 and has attracted significantly growing attention in the last decade due to the success of deep learning models in natural language processing. Many new methods have been proposed to address the challenges in cross-lingual text classification. Meanwhile, cross-lingual fake news detection is one of the most important applications of cross-lingual text classification. It has already created significant social impacts on alleviating the infodemic problem in low-resource languages. The research works on cross-lingual text classification and cross-lingual fake news detection have been growing rapidly in recent years. Therefore, a comprehensive survey is imperative to summarize existing algorithms for cross-lingual text classification and explain the connections among them. This paper systematically reviews research works on cross-lingual text classifications and their applications in cross-lingual fake news detection. We categorize the evolution of cross-lingual text classification methods into four phases: (1) Traditional text classification models with translation; (2) Cross-lingual word embedding-based methods, (3) Pretraining then finetuning-based methods, and (4) Pretraining then prompting-based methods. We first discuss and analyze the representative methods in each phase in detail. Second, we provide a detailed review of their applications in the emerging fake news detection problem. Finally, we explore the potential issues of this open problem and also discuss possible future directions.

  • articleNo Access

    TWOSOME: An Efficient Online Framework to Align LLMs with Embodied Environments via Reinforcement Learning

    Despite the impressive performance across numerous tasks, Large Language Models (LLMs) often fail in solving simple decision-making tasks due to the misalignment of the knowledge in LLMs with environments. On the contrary, Reinforcement Learning (RL) agents learn policies from scratch, which makes them always align with environments but difficult to incorporate prior knowledge for efficient explorations. To narrow the gap, we propose TWOSOME, a novel general online framework that deploys LLMs as decision-making agents to efficiently interact and align with embodied environments via RL without requiring any prepared datasets or prior knowledge of the environments. First, we query the joint probabilities of each valid action with LLMs to form behavior policies. Then, to enhance the stability and robustness of the policies, we propose two normalization methods and summarize four prompt design principles. Finally, we design a novel parameter-efficient training architecture where the actor and critic share one frozen LLM equipped with LOw-Rank Adapters (LoRA) updated by PPO. We conduct extensive experiments to evaluate TWOSOME. (i) TWOSOME exhibits significantly better sample efficiency and performance compared to the conventional RL method, PPO, and prompt tuning method, SayCan, in both classical decision-making environment, Overcooked, and simulated household environment, VirtualHome. (ii) Benefiting from LLMs’ open-vocabulary feature, TWOSOME shows superior generalization ability to unseen tasks. (iii) Under our framework, there is no significant loss of the LLMs’ original ability during online PPO finetuning.a

  • chapterOpen Access

    Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature

    The quickly-expanding nature of published medical literature makes it challenging for clinicians and researchers to keep up with and summarize recent, relevant findings in a timely manner. While several closed-source summarization tools based on large language models (LLMs) now exist, rigorous and systematic evaluations of their outputs are lacking. Furthermore, there is a paucity of high-quality datasets and appropriate benchmark tasks with which to evaluate these tools. We address these issues with four contributions: we release Clinfo.ai, an open-source WebApp that answers clinical questions based on dynamically retrieved scientific literature; we specify an information retrieval and abstractive summarization task to evaluate the performance of such retrieval-augmented LLM systems; we release a dataset of 200 questions and corresponding answers derived from published systematic reviews, which we name PubMed Retrieval and Synthesis (PubMedRS-200); and report benchmark results for Clinfo.ai and other publicly available OpenQA systems on PubMedRS-200.

  • chapterOpen Access

    A Conversational Agent for Early Detection of Neurotoxic Effects of Medications through Automated Intensive Observation

    We present a fully automated AI-based system for intensive monitoring of cognitive symptoms of neurotoxicity that frequently appear as a result of immunotherapy of hematologic malignancies. Early manifestations of these symptoms are evident in the patient’s speech in the form of mild aphasia and confusion and can be detected and effectively treated prior to onset of more serious and potentially life-threatening impairment. We have developed the Automated Neural Nursing Assistant (ANNA) system designed to conduct a brief cognitive assessment several times per day over the telephone for 5-14 days following infusion of the immunotherapy medication. ANNA uses a conversational agent based on a large language model to elicit spontaneous speech in a semi-structured dialogue, followed by a series of brief language-based neurocognitive tests. In this paper we share ANNA’s design and implementation, results of a pilot functional evaluation study, and discuss technical and logistic challenges facing the introduction of this type of technology in clinical practice. A large-scale clinical evaluation of ANNA will be conducted in an observational study of patients undergoing immunotherapy at the University of Minnesota Masonic Cancer Center starting in the Fall 2023.

  • chapterOpen Access

    VetLLM: Large Language Model for Predicting Diagnosis from Veterinary Notes

    Lack of diagnosis coding is a barrier to leveraging veterinary notes for medical and public health research. Previous work is limited to develop specialized rule-based or customized supervised learning models to predict diagnosis coding, which is tedious and not easily transferable. In this work, we show that open-source large language models (LLMs) pretrained on general corpus can achieve reasonable performance in a zero-shot setting. Alpaca-7B can achieve a zero-shot F1 of 0.538 on CSU test data and 0.389 on PP test data, two standard benchmarks for coding from veterinary notes. Furthermore, with appropriate fine-tuning, the performance of LLMs can be substantially boosted, exceeding those of strong state-of-the-art supervised models. VetLLM, which is fine-tuned on Alpaca-7B using just 5000 veterinary notes, can achieve a F1 of 0.747 on CSU test data and 0.637 on PP test data. It is of note that our fine-tuning is data-efficient: using 200 notes can outperform supervised models trained with more than 100,000 notes. The findings demonstrate the great potential of leveraging LLMs for language processing tasks in medicine, and we advocate this new paradigm for processing clinical text.

  • chapterOpen Access

    Session Introduction: AI and Machine Learning in Clinical Medicine: Generative and Interactive Systems at the Human-Machine Interface

    Artificial Intelligence (AI) technologies are increasingly capable of processing complex and multilayered datasets. Innovations in generative AI and deep learning have notably enhanced the extraction of insights from both unstructured texts, images, and structured data alike. These breakthroughs in AI technology have spurred a wave of research in the medical field, leading to the creation of a variety of tools aimed at improving clinical decision-making, patient monitoring, image analysis, and emergency response systems. However, thorough research is essential to fully understand the broader impact and potential consequences of deploying AI within the healthcare sector.

  • chapterOpen Access

    QUEST-AI: A System for Question Generation, Verification, and Refinement using AI for USMLE-Style Exams

    The United States Medical Licensing Examination (USMLE) is a critical step in assessing the competence of future physicians, yet the process of creating exam questions and study materials is both time-consuming and costly. While Large Language Models (LLMs), such as OpenAI’s GPT-4, have demonstrated proficiency in answering medical exam questions, their potential in generating such questions remains underexplored. This study presents QUEST-AI, a novel system that utilizes LLMs to (1) generate USMLE-style questions, (2) identify and flag incorrect questions, and (3) correct errors in the flagged questions. We evaluated this system’s output by constructing a test set of 50 LLM-generated questions mixed with 50 human-generated questions and conducting a two-part assessment with three physicians and two medical students. The assessors attempted to distinguish between LLM and human-generated questions and evaluated the validity of the LLM-generated content. A majority of exam questions generated by QUEST-AI were deemed valid by a panel of three clinicians, with strong correlations between performance on LLM-generated and human-generated questions. This pioneering application of LLMs in medical education could significantly increase the ease and efficiency of developing USMLE-style medical exam content, offering a cost-effective and accessible alternative for exam preparation.

  • chapterOpen Access

    LLM-CGM: A Benchmark for Large Language Model-Enabled Querying of Continuous Glucose Monitoring Data for Conversational Diabetes Management

    Over the past decade, wearable technology has dramatically changed how patients manage chronic diseases. The widespread availability of on-body sensors, such as heart rate monitors and continuous glucose monitoring (CGM) sensors, has allowed patients to have real-time data about their health. Most of these data are readily available on patients’ smartphone applications, where patients can view their current and retrospective data. For patients with diabetes, CGM has transformed how their disease is managed. Many sensor devices interface with smartphones to display charts, metrics, and alerts. However, these metrics and plots may be challenging for some patients to interpret. In this work, we explore how large language models (LLMs) can be used to answer questions about CGM data. We produce an open-source benchmark of time-series question-answering tasks for CGM data in diabetes management. We evaluate different LLM frameworks to provide a performance benchmark. Lastly, we highlight the need for more research on how to optimize LLM frameworks to best handle questions about wearable data. Our benchmark is publicly available for future use and development. While this benchmark is specifically designed for diabetes care, our model implementation and several of the statistical tasks can be extended to other wearable device domains.

  • chapterOpen Access

    Using Large Language Models for Efficient Cancer Registry Coding in the Real Hospital Setting: A Feasibility Study

    The primary challenge in reporting cancer cases lies in the labor-intensive and time-consuming process of manually reviewing numerous reports. Current methods predominantly rely on rule-based approaches or custom-supervised learning models, which predict diagnostic codes based on a single pathology report per patient. Although these methods show promising evaluation results, their biased outcomes in controlled settings may hinder adaption to real-world reporting workflows. In this feasibility study, we focused on lung cancer as a test case and developed an agentic retrieval-augmented generation (RAG) system to evaluate the potential of publicly available large language models (LLMs) for cancer registry coding. Our findings demonstrate that: (1) directly applying publicly available LLMs without fine-tuning is feasible for cancer registry coding; and (2) prompt engineering can significantly enhance the capability of pre-trained LLMs in cancer registry coding. The off-the-shelf LLM, combined with our proposed system architecture and basic prompts, achieved a macro-averaged F-score of 0.637 when evaluated on testing data consisting of patients’ medical reports spanning 1.5 years since their first visit. By employing chain of thought (CoT) reasoning and our proposed coding item grouping, the system outperformed the baseline by 0.187 in terms of the macro-averaged F-score. These findings demonstrate the great potential of leveraging LLMs with prompt engineering for cancer registry coding. Our system could offer cancer registrars a promising reference tool to enhance their daily workflow, improving efficiency and accuracy in cancer case reporting.

  • chapterOpen Access

    Automated Evaluation of Antibiotic Prescribing Guideline Concordance in Pediatric Sinusitis Clinical Notes

    Background: Ensuring antibiotics are prescribed only when necessary is crucial for maintaining their effectiveness and is a key focus of public health initiatives worldwide. In cases of sinusitis, among the most common reasons for antibiotic prescriptions in children, healthcare providers must distinguish between bacterial and viral causes based on clinical signs and symptoms. However, due to the overlap between symptoms of acute sinusitis and viral upper respiratory infections, antibiotics are often over-prescribed.

    Objectives: Currently, there are no electronic health record (EHR)-based methods, such as lab tests or ICD-10 codes, to retroactively assess the appropriateness of prescriptions for sinusitis, making manual chart reviews the only available method for evaluation, which is time-intensive and not feasible at a large scale. In this study, we propose using natural language processing to automate this assessment.

    Methods: We developed, trained, and evaluated generative models to classify the appropriateness of antibiotic prescriptions in 300 clinical notes from pediatric patients with sinusitis seen at a primary care practice in the Children’s Hospital of Philadelphia network. We utilized standard prompt engineering techniques, including few-shot learning and chain-of-thought prompting, to refine an initial prompt. Additionally, we employed Parameter-Efficient Fine-Tuning to train a medium-sized generative model Llama 3 70B-instruct.

    Results: While parameter-efficient fine-tuning did not enhance performance, the combination of few-shot learning and chain-of-thought prompting proved beneficial. Our best results were achieved using the largest generative model publicly available to date, the Llama 3.1 405B-instruct. On our evaluation set, the model correctly identified 94.7% of the 152 notes where antibiotic prescription was appropriate and 66.2% of the 83 notes where it was not appropriate. However, 15 notes that were insufficiently, vaguely, or ambiguously documented by physicians posed a challenge to our model, as none were accurately classified.

    Conclusion: Our generative model demonstrated good performance in the challenging task of chart review. This level of performance may be sufficient for deploying the model within the EHR, where it can assist physicians in real-time to prescribe antibiotics in concordance with the guidelines, or for monitoring antibiotic stewardship on a large scale.

  • chapterOpen Access

    Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions

    The emergent abilities of large language models (LLMs) have demonstrated great potential in solving medical questions. They can possess considerable medical knowledge, but may still hallucinate and are inflexible in the knowledge updates. While Retrieval-Augmented Generation (RAG) has been proposed to enhance the medical question-answering capabilities of LLMs with external knowledge bases, it may still fail in complex cases where multiple rounds of information-seeking are required. To address such an issue, we propose iterative RAG for medicine (i-MedRAG), where LLMs can iteratively ask follow-up queries based on previous information-seeking attempts. In each iteration of i-MedRAG, the follow-up queries will be answered by a vanilla RAG system and they will be further used to guide the query generation in the next iteration. Our experiments show the improved performance of various LLMs brought by i-MedRAG compared with vanilla RAG on complex questions from clinical vignettes in the United States Medical Licensing Examination (USMLE), as well as various knowledge tests in the Massive Multitask Language Understanding (MMLU) dataset. Notably, our zero-shot i-MedRAG outperforms all existing prompt engineering and fine-tuning methods on GPT-3.5, achieving an accuracy of 69.68% on the MedQA dataset. In addition, we characterize the scaling properties of i-MedRAG with different iterations of follow-up queries and different numbers of queries per iteration. Our case studies show that i-MedRAG can flexibly ask follow-up queries to form reasoning chains, providing an in-depth analysis of medical questions. To the best of our knowledge, this is the first-of-its-kind study on incorporating follow-up queries into medical RAG.

  • chapterOpen Access

    PGxQA: A Resource for Evaluating LLM Performance for Pharmacogenomic QA Tasks

    Pharmacogenetics represents one of the most promising areas of precision medicine, with several guidelines for genetics-guided treatment ready for clinical use. Despite this, implementation has been slow, with few health systems incorporating the technology into their standard of care. One major barrier to uptake is the lack of education and awareness of pharmacogenetics among clinicians and patients. The introduction of large language models (LLMs) like GPT-4 has raised the possibility of medical chatbots that deliver timely information to clinicians, patients, and researchers with a simple interface. Although state-of-the-art LLMs have shown impressive performance at advanced tasks like medical licensing exams, in practice they still often provide false information, which is particularly hazardous in a clinical context. To quantify the extent of this issue, we developed a series of automated and expert-scored tests to evaluate the performance of chatbots in answering pharmacogenetics questions from the perspective of clinicians, patients, and researchers. We applied this benchmark to state-of-the-art LLMs and found that newer models like GPT-4o greatly outperform their predecessors, but still fall short of the standards required for clinical use. Our benchmark will be a valuable public resource for subsequent developments in this space as we work towards better clinical AI for pharmacogenetics.

  • chapterOpen Access

    Leveraging Foundational Models in Computational Biology: Validation, Understanding, and Innovation

    Large Language Models (LLMs) have shown significant promise across a wide array of fields, including biomedical research, but face notable limitations in their current applications. While they offer a new paradigm for data analysis and hypothesis generation, their efficacy in computational biology trails other applications such as natural language processing. This workshop addresses the state of the art in LLMs, discussing their challenges and the potential for future development tailored to computational biology. Key issues include difficulties in validating LLM outputs, proprietary model limitations, and the need for expertise in critical evaluation of model failure modes.