Publications
Publications by categories in reversed chronological order.
2025
- Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement ProtocolRoham Koohestani, Philippe Bekker, and Maliheh Izadi2025
Benchmarks are essential for consistent evaluation and reproducibility. The integration of Artificial Intelligence into Software Engineering (AI4SE) has given rise to numerous benchmarks for tasks such as code generation and bug fixing. However, this surge presents challenges: (1) scattered benchmark knowledge across tasks, (2) difficulty in selecting relevant benchmarks, (3) the absence of a uniform standard for benchmark development, and (4) limitations of existing benchmarks. In this paper, we review 173 studies and identify 204 AI4SE benchmarks. We classify these benchmarks, analyze their limitations, and expose gaps in practices. Based on our review, we created BenchScout, a semantic search tool to find relevant benchmarks, using automated clustering of the contexts from associated studies. We conducted a user study with 22 participants to evaluate BenchScout’s usability, effectiveness, and intuitiveness which resulted in average scores of 4.5, 4.0, and 4.1 out of 5. To advance benchmarking standards, we propose BenchFrame, a unified method to enhance benchmark quality. As a case study, we applied BenchFrame to the HumanEval benchmark and addressed its main limitations. This led to HumanEvalNext, featuring (1) corrected errors, (2) improved language conversion, (3) expanded test coverage, and (4) increased difficulty. We then evaluated ten state-of-the-art code language models on HumanEval, HumanEvalPlus, and HumanEvalNext. On HumanEvalNext, models showed a pass@1 score reduction of 31.22% and 19.94% compared to HumanEval and HumanEvalPlus, respectively.
- Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming TasksAli Al-kaswan, Sebastian Deatc, Begum Koc, Arie Deursen , and 1 more authorIn 2025 ACM International Conference on the Foundations of Software Engineering (FSE), 2025
Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks. This makes it crucial to align these tools with human values to prevent malicious misuse. In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain. We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs. Furthermore, we investigate the impact of models’ size, architecture family, and alignment strategies on their tendency to generate harmful content. The results show significant disparities in the alignment of various LLMs for harmlessness. We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts. Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices. On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information. These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area.
- A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation MetricsJonathan Katzy, Razvan Mihai Popescu, Arie Deursen, and Maliheh IzadiIn The International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE), 2025
Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multi-lingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment correctness across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.
- Long code arena: a set of benchmarks for long-context code modelsEgor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov , and 7 more authorsarXiv preprint arXiv:2406.11612, 2025
Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.
- The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language ModelsJonathan Katzy, Razvan Mihai Popescu, Arie Deursen, and Maliheh IzadiIn 2024 The 2nd ACM international conference on AI Foundation Models and Software Engineering (FORGE), 2025
The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.
- Rethinking IDE Customization for Enhanced HAX: A Hyperdimensional PerspectiveRoham Koohestani, and Maliheh IzadiIn 2025 The 2nd Workshop on Integrated Development Environments (IDE), 2025
As Integrated Development Environments (IDEs) increasingly integrate Artificial Intelligence, Software Engineering faces both benefits like productivity gains and challenges like mismatched user preferences. We propose Hyper-Dimensional (HD) vector spaces to model Human-Computer Interaction, focusing on user actions, stylistic preferences, and project context. These contributions aim to inspire further research on applying HD computing in IDE design.
- How Much Do Code Language Models Remember? An Investigation on Data Extraction Attacks before and after Fine-tuningFabio Salerno, Ali Al-Kaswan, and Maliheh IzadiIn 2025 IEEE/ACM 22th International Conference on Mining Software Repositories (MSR), 2025
Code language models, while widely popular, are often trained on unsanitized source code gathered from across the Internet. Previous work revealed that pre-trained models can remember the content of their training data and regurgitate them through data extraction attacks. Due to the large size of current models, only a few entities have the resources for pre-training such models. However, fine-tuning requires fewer resources and is increasingly used by both small and large entities for its effectiveness on specialized data. Such small curated data for fine-tuning might contain sensitive information or proprietary assets. In this study, we attack both pre-trained and fine-tuned code language models to investigate the extent of data extractability. We first develop a custom benchmark to assess the vulnerability of both pre-training and fine-tuning samples to extraction attacks. Our findings reveal that 54.9% of extractable pre-training data could be retrieved from StarCoder2-15B, whereas this number decreased to 23.5% after fine-tuning. This indicates that fine-tuning reduces the extractability of pre-training data. However, compared to larger models, fine-tuning smaller models increases their vulnerability to data extraction attacks on fine-tuning data. Given the potential sensitivity of fine-tuning data, this can lead to more severe consequences. Lastly, we also manually analyzed 2000 extractable samples before and after fine-tuning. We also found that data carriers and licensing information are the most likely data categories to be memorized from pre-trained and fine-tuned models, while the latter is the most likely to be forgotten after fine-tuning.
- Automating the Detection of Code Vulnerabilities by Analyzing GitHub IssuesDaniele Cipollone, Changjie Wang, Mariano Scazzariello, Simone Ferlin , and 3 more authorsarXiv preprint arXiv:2501.05258, 2025
In today’s digital landscape, the importance of timely and accurate vulnerability detection has significantly increased. This paper presents a novel approach that leverages transformer-based models and machine learning techniques to automate the identification of software vulnerabilities by analyzing GitHub issues. We introduce a new dataset specifically designed for classifying GitHub issues relevant to vulnerability detection. We then examine various classification techniques to determine their effectiveness. The results demonstrate the potential of this approach for real-world application in early vulnerability detection, which could substantially reduce the window of exploitation for software vulnerabilities. This research makes a key contribution to the field by providing a scalable and computationally efficient framework for automated detection, enabling the prevention of compromised software usage before official notifications. This work has the potential to enhance the security of open-source software ecosystems.
- Leveraging large language models for enhancing the understandability of generated unit testsAmirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy ZaidmanIn 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025
Automated unit test generators, particularly search-based software testing tools like EvoSuite, are capable of generating tests with high coverage. Although these generators alleviate the burden of writing unit tests, they often pose challenges for software engineers in terms of understanding the generated tests. To address this, we introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases. We achieve this enhancement through contextualizing test data, improving identifier naming, and adding descriptive comments. Through a controlled experiment with 32 participants from both academia and industry, we investigate how the understandability of unit tests affects a software engineer’s ability to perform bug-fixing tasks. We selected bug-fixing to simulate a real-world scenario that emphasizes the importance of understandable test cases. We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases. From the post-test questionnaire, we gathered that participants found that enhanced test names, test data, and variable names improved their bug-fixing process.
- Human-AI Experience in Integrated Development Environments: A Systematic Literature ReviewAgnia Sergeyuk, Ilya Zakharov, Ekaterina Koshchenko, and Maliheh Izadi2025
The integration of Artificial Intelligence (AI) into Integrated Development Environments (IDEs) is reshaping software development, fundamentally altering how developers interact with their tools. This shift marks the emergence of Human-AI Experience in Integrated Development Environment (in-IDE HAX), a field that explores the evolving dynamics of Human-Computer Interaction in AI-assisted coding environments. Despite rapid adoption, research on in-IDE HAX remains fragmented which highlights the need for a unified overview of current practices, challenges, and opportunities. To provide a structured overview of existing research, we conduct a systematic literature review of 89 studies, summarizing current findings and outlining areas for further investigation. Our findings reveal that AI-assisted coding enhances developer productivity but also introduces challenges, such as verification overhead, automation bias, and over-reliance, particularly among novice developers. Furthermore, concerns about code correctness, security, and maintainability highlight the urgent need for explainability, verification mechanisms, and adaptive user control. Although recent advances have driven the field forward, significant research gaps remain, including a lack of longitudinal studies, personalization strategies, and AI governance frameworks. This review provides a foundation for advancing in-IDE HAX research and offers guidance for responsibly integrating AI into software development.
- HyperSeq: A Hyper-Adaptive Representation for Predictive Sequencing of StatesRoham Koohestani, and Maliheh IzadiIn 2025 ACM International Conference on the Foundations of Software Engineering (FSE), 2025
In the rapidly evolving world of software development, the surge in developers’ reliance on AI-driven tools has transformed Integrated Development Environments into powerhouses of advanced features. This transformation, while boosting developers’ productivity to unprecedented levels, comes with a catch: increased hardware demands for software development. Moreover, the significant economic and environmental toll of using these sophisticated models necessitates mechanisms that reduce unnecessary computational burdens. We propose HyperSeq - Hyper-Adaptive Representation for Predictive Sequencing of States - a novel, resource-efficient approach designed to model developers’ cognitive states. HyperSeq facilitates precise action sequencing and enables real-time learning of user behavior. Our preliminary results show how HyperSeq excels in forecasting action sequences and achieves remarkable prediction accuracies that go beyond 70%. Notably, the model’s online-learning capability allows it to substantially enhance its predictive accuracy in a majority of cases and increases its capability in forecasting next user actions with sufficient iterations for adaptation. Ultimately, our objective is to harness these predictions to refine and elevate the user experience dynamically within the IDE.
2024
- The Design Space of in-IDE Human-AI ExperienceAgnia Sergeyuk, Ekaterina Koshchenko, Ilya Zakharov, Timofey Bryksin , and 1 more authorarXiv preprint arXiv:2410.08676, 2024
Nowadays, integration of AI-driven tools within Integrated Development Environments (IDEs) is reshaping the software development lifecycle. Existing research highlights that users expect these tools to be efficient, context-aware, accurate, user-friendly, customizable, and secure. However, a major gap remains in understanding developers’ needs and challenges, particularly when interacting with AI systems in IDEs and from the perspectives of different user groups. In this work, we address this gap through structured interviews with 35 developers from three different groups: Adopters, Churners, and Non-Users of AI in IDEs to create a comprehensive Design Space of in-IDE Human-AI Experience. Our results highlight key areas of Technology Improvement, Interaction, and Alignment in in-IDE AI systems, as well as Simplifying Skill Building and Programming Tasks. Our key findings stress the need for AI systems that are more personalized, proactive, and reliable. We also emphasize the importance of context-aware and privacy-focused solutions and better integration with existing workflows. Furthermore, our findings show that while Adopters appreciate advanced features and non-interruptive integration, Churners emphasize the need for improved reliability and privacy. Non-Users, in contrast, focus on skill development and ethical concerns as barriers to adoption. Lastly, we provide recommendations for industry practitioners aiming to enhance AI integration within developer workflows.
- An exploratory investigation into code license infringements in large language model training datasetsJonathan Katzy, Razvan Popescu, Arie Van Deursen, and Maliheh Izadi2024
Does the training of large language models potentially infringe upon code licenses? Furthermore, are there any datasets available that can be safely used for training these models without violating such licenses? In our study, we assess the current trends in the field and the importance of incorporating code into the training of large language models. Additionally, we examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future. To accomplish this, we compiled a list of 53 large language models trained on file-level code. We then extracted their datasets and analyzed how much they overlap with a dataset we created, consisting exclusively of strong copyleft code.Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of …
- Beyond Acceptance Rates: The Impact of JetBrains AI Assistant and FLCCRemco Schrijver, Pouria Derakhshanfar, Annibale Panichella, Arie Deursen , and 1 more author2024
LLM (Large Language Model) powered AI (Artificial Intelligence) assistants are a popular tool used by programmers, but what impact do they have? In this thesis we investigate two such tools designed by JetBrains: AI Assistant and FLCC (Full Line Code Completion). We collected over 40 million actionsincluding editing, code executions, and typing-in the form of metric data spread out over 26 thousand users. With this data, we look at how user behavior changes when assisted by AI Assistant or FLCC.Users spent more time in their IDEs (Integrated Development Environment) and typed more when assisted. In most cases, we see a decline in (manual) testing, or at best an equivalent level. And how do multi-programming language benchmarks reflect acceptance rates by users for these respective languages? There seems to be no real correlation between these benchmark results and what users accept in their generations, but the available benchmarks are also limited.
- Generative AI in Software Engineering Must be Human-centered: The Copenhagen ManifestoDaniel Russo, Sebastian Baltes, Niels Berkel, Paris Avgeriou , and 31 more authorsJournal of Systems and Software, 2024
- Creativity, Generative AI, and Software Development: A Research AgendaVictoria Jackson, Bogdan Vasilescu, Daniel Russo, Paul Ralph , and 6 more authors2024
Creativity has always been considered a major differentiator to separate the good from the great, and we believe the importance of creativity for software development will only increase as GenAI becomes embedded in developer tool-chains and working practices. This paper uses the McLuhan tetrad alongside scenarios of how GenAI may disrupt software development more broadly, to identify potential impacts GenAI may have on creativity within software development. The impacts are discussed along with a future research agenda comprising six connected themes that consider how individual capabilities, team capabilities, the product, unintended consequences, society, and human aspects can be affected.
- A transformer-based approach for smart invocation of automatic code completionAral Moor, Arie Deursen, and Maliheh Izadi2024
We received the ACM Distinguished Paper Award for this work.
Transformer-based language models are highly effective for code completion, with much research dedicated to enhancing the content of these completions. Despite their effectiveness, these models come with high operational costs and can be intrusive, especially when they suggest too often and interrupt developers who are concentrating on their work. Current research largely overlooks how these models interact with developers in practice and neglects to address when a developer should receive completion suggestions. To tackle this issue, we developed a machine learning model that can accurately predict when to invoke a code completion tool given the code context and available telemetry data. To do so, we collect a dataset of 200k developer interactions with our cross-IDE code completion plugin and train several invocation filtering models. Our results indicate that our small-scale transformer model …
- In-ide human-ai experience in the era of large language models; a literature reviewAgnia Sergeyuk, Sergey Titov, and Maliheh IzadiIn 2024 The 1st Workshop on Integrated Development Environments (IDE), 2024
Integrated Development Environments (IDEs) have become central to modern software development, especially with the integration of Artificial Intelligence (AI) to enhance programming efficiency and decision-making. The study of in-IDE Human-AI Experience is critical in understanding how these AI tools are transforming the software development process, impacting programmer productivity, and influencing code quality.We conducted a literature review to study the current state of in-IDE Human-AI Experience research, bridging a gap in understanding the nuanced interactions between programmers and AI assistants within IDEs. By analyzing 36 selected papers, our study illustrates three primary research branches: Design, Impact, and Quality of Interaction.The trends, challenges, and opportunities identified in this paper emphasize the evolving landscape of software development and inform future …
- Investigating the performance of language models for completing code in functional programming languages: a haskell case studyTim Van Dam, Frank Van Heijden, Philippe De Bekker, Berend Nieuwschepen , and 2 more authorsIn 2024 The 1st ACM international conference on AI Foundation Models and Software Engineering (FORGE), 2024
Language model-based code completion models have quickly grown in use, helping thousands of developers write code in many different programming languages. However, research on code completion models typically focuses on imperative languages such as Python and JavaScript, which results in a lack of representation for functional programming languages. Consequently, these models often perform poorly on functional languages such as Haskell. To investigate whether this can be alleviated, we evaluate the performance of two language models for code, CodeGPT and UniXcoder, on the functional programming language Haskell. We fine-tune and evaluate the models on Haskell functions sourced from a publicly accessible Haskell dataset on HuggingFace. Additionally, we manually evaluate the models using our novel translated HumanEval dataset. Our automatic evaluation shows that knowledge of …
- Language models for code completion: A practical evaluationMaliheh Izadi, Jonathan Katzy, Tim Van Dam, Marc Otten , and 2 more authorsIn 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), 2024
Transformer-based language models for automatic code completion have shown great promise so far, yet the evaluation of these models rarely uses real data. This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We first developed an open-source IDE extension, Code4Me, for the online evaluation of the models. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. These models were then evaluated using six standard metrics across twelve programming languages. Next, we conducted a qualitative study of 1690 real-world completion requests to identify the reasons behind the poor model performance. A comparative analysis of the models’ performance in online and offline settings was also performed, using benchmark synthetic datasets and two …
- Traces of memorisation in large language models for codeAli Al-Kaswan, Maliheh Izadi, and Arie Van DeursenIn 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), 2024
Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extraction attacks. In this work, we explore memorisation in large language models for code and compare the rate of memorisation with large language models trained on natural language. We adopt an existing benchmark for natural language and construct a benchmark for code by identifying samples that are vulnerable to attack. We run both benchmarks against a variety of models, and perform a data extraction attack. We find that large language models for code are vulnerable to data extraction attacks, like their …
- Maven Unzipped: Exploring the Impact of Library Packaging on the EcosystemMehdi Keshani, Gideon Bot, Priyam Rungta, Maliheh Izadi , and 2 more authorsIn 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2024
MAVEN is a popular dependency management tool and ecosystem used by millions of developers. However, the over-whelming amount of available open-source software and the lack of proper ecosystem governance pose risks to the security and effectiveness of the ecosystem. This necessitates a comprehensive understanding of the ecosystem to guide future decision-making and promote effective practices. Despite numerous studies on aspects of Maven,such as vulnerabilities, breaking changes, and bloated dependencies, a knowledge gap concerning its overall state and health still exists. This gap impedes the adoption of effective practices, potentially impacting the productivity and efficiency of projects and the ecosystem as a whole. This paper explores the fundamental aspects of the Mavenecosystem. We investigate the packaging practices of Mavenlibraries with a focus on the content of the libraries, their …
- Message from NLBSE 2024 Program ChairsMaliheh Izadi, Andrea Di Sorbo, and Sebastiano Panichella2024
Natural Language Processing (NLP) refers to the automated elaboration of human language, including both algorithms that take human-produced text as input and algorithms that produce natural-looking text as outputs. NLP is widely used to optimize many aspects of the software development process. Since natural language artifacts are used and reused during the software development life-cycle, the availability of natural language-based approaches and tools has led to improvements in the software process and product efficiency. Indeed, NLP approaches (including LLMs) have proven useful for retrieving key information from a wide range of structured or unstructured sources. Besides, they show promise for the automated generation of fine-grained source code documentation to ease program comprehension and maintenance activities. Literature has shown that many software engineering (SE)-related tasks …
- Message from the Industry Track Co-ChairsMaliheh Izadi, and Laura Moreno2024
The MSR 2024 Industry Track is the venue to present and learn about the opportunities, challenges, and cutting-edge technology related to using data from software repositories in practice. For a long time, academic researchers in software engineering have been looking to learn and collaborate with practitioners. Our goal for the Industry Track is to be the space for a productive dialogue between software engineering researchers and practitioners, particularly those building tools for other software professionals.In this third edition of the Industry Track, we were looking to minimize the known barriers to industry participation in software engineering research conferences. We invited practitioners to submit an abstract, maximum one page long (plus up to one page of references), outlining a talk or a poster presentation.
- The potential of an adaptive computerized dynamic assessment tutor in diagnosing and assessing learners’ listening comprehensionMehri Izadi, Maliheh Izadi, and Farrokhlagha HeidariEducation and Information Technologies, 2024
In today’s environment of growing class sizes due to the prevalence of online and e-learning systems, providing one-to-one instruction and feedback has become a challenging task for teachers. Anyhow, the dialectical integration of instruction and assessment into a seamless and dynamic activity can provide a continuous flow of assessment information for teachers to boost and individualize learning. In this regard, adaptive learning technology is one way to facilitate teacher-supported learning and personalize curriculum and learning experiences. This study aimed to investigate the potential of an adaptive Computerized Dynamic Assessment (C-DA) tool applicable as a language diagnostician and assistant. The study tried to get insight into 75 Iranian EFL learners’ listening development by focusing on the learning potential exhibited through learners’ assessment and the degree of internalization of mediation. To …
- The Impact of Generative AI on Creativity in Software Development: A Research AgendaVictoria Jackson, Bogdan Vasilescu, Daniel Russo, Paul Ralph , and 6 more authors2024
As GenAI becomes embedded in developer toolchains and practices, and routine code is increasingly generated, human creativity will be increasingly important for generating competitive advantage. This paper uses the McLuhan tetrad alongside scenarios of how GenAI may disrupt software development more broadly, to identify potential impacts GenAI may have on creativity within software development. The impacts are discussed along with a future research agenda comprising five connected themes that consider how individual capabilities, team capabilities, the product, unintended consequences, and society. can be affected.
2023
- On the impact of language selection for training and evaluating programming language modelsJonathan Katzy, Maliheh Izadi, and Arie Van DeursenIn 2023 IEEE 23rd International Working Conference on Source Code Analysis and Manipulation (SCAM), 2023
The recent advancements in Transformer-based Language Models have demonstrated significant potential in enhancing the multilingual capabilities of these models. The remarkable progress made in this domain not only applies to natural language tasks but also extends to the domain of programming languages. Despite the ability of these models to learn from multiple languages, evaluations typically focus on particular combinations of the same languages. In this study, we evaluate the similarity of programming languages by analyzing their representations using a CodeBERT-based model. Our experiments reveal that token representation in languages such as C++, Python, and Java exhibit proximity to one another, whereas the same tokens in languages such as Mathematica and R display significant dissimilarity. Our findings suggest that this phenomenon can potentially result in performance challenges …
- Enriching source code with contextual data for code completion models: An empirical studyTim Dam, Maliheh Izadi, and Arie DeursenIn 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), 2023
Transformer-based pre-trained models have recently achieved great results in solving many software engineering tasks including automatic code completion which is a staple in a developer’s toolkit. While many have striven to improve the code-understanding abilities of such models, the opposite – making the code easier to understand – has not been properly investigated. In this study, we aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion. We consider type annotations and comments as two common forms of additional contextual information that often help developers understand code better. For the experiments, we study code completion in two granularity levels; token and line completion and take three recent and large-scale language models for source code: UniXcoder …
- Extending source code pre-trained language models to summarise decompiled binariesAli Al-Kaswan, Toufique Ahmed, Maliheh Izadi, Anand Ashok Sawant , and 2 more authorsIn 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2023
Binary reverse engineering is used to understand and analyse programs for which the source code is unavailable. Decompilers can help, transforming opaque binaries into a more readable source code-like representation. Still, reverse engineering is difficult and costly, involving considering effort in labelling code with helpful summaries. While the automated summarisation of decompiled code can help reverse engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise de-compiled binary functions. Further-more, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a dataset of 214K decompiled function …
- Semantically-enhanced topic recommendation systems for software projectsMaliheh Izadi, Mahtab Nejati, and Abbas HeydarnooriEmpirical Software Engineering, 2023
Software-related platforms such as GitHub and Stack Overflow, have enabled their users to collaboratively label software entities with a form of metadata called topics. Tagging software repositories with relevant topics can be exploited for facilitating various downstream tasks. For instance, a correct and complete set of topics assigned to a repository can increase its visibility. Consequently, this improves the outcome of tasks such as browsing, searching, navigation, and organization of repositories. Unfortunately, assigned topics are usually highly noisy, and some repositories do not have well-assigned topics. Thus, there have been efforts on recommending topics for software projects, however, the semantic relationships among these topics have not been exploited so far. In this work, we propose two recommender models for tagging software projects that incorporate the semantic relationship among topics. Our …
- Stacc: Code comment classification using sentencetransformersAli Al-Kaswan, Maliheh Izadi, and Arie Van DeursenIn 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE), 2023
Code comments are a key resource for information about software artefacts. Depending on the use case, only some types of comments are useful. Thus, automatic approaches to clas-sify these comments have been proposed. In this work, we address this need by proposing, STACC, a set of SentenceTransformers- based binary classifiers. These lightweight classifiers are trained and tested on the NLBSE Code Comment Classification tool competition dataset, and surpass the baseline by a significant margin, achieving an average Fl score of 0.74 against the baseline of 0.31, which is an improvement of 139%. A replication package, as well as the models themselves, are publicly available.
- Targeted Attack on GPT-Neo for the SATML Language Model Data Extraction Challenge [PRESENTATION]Ali Al-Kaswan, Maliheh Izadi, and Arie DeursenIn 1st IEEE Conference on Secure and Trustworthy Machine Learning, 2023
Previous work has shown that Large Language Models are susceptible to so-called data extraction attacks. This allows an attacker to extract a sample that was contained in the training data, which has massive privacy implications. The construction of data extraction attacks is challenging, current attacks are quite inefficient, and there exists a significant gap in the extraction capabilities of untargeted attacks and memorization. Thus, targeted attacks are proposed, which identify if a given sample from the training data, is extractable from a model. In this work, we apply a targeted data extraction attack to the SATML2023 Language Model Training Data Extraction Challenge. We apply a two-step approach. In the first step, we maximise the recall of the model and are able to extract the suffix for 69% of the samples. In the second step, we use a classifier-based Membership Inference Attack on the generations. Our AutoSklearn classifier achieves a precision of 0.841. The full approach reaches a score of 0.405 recall at a 10% false positive rate, which is an improvement of 34% over the baseline of 0.301.
- The (ab) use of open source code to train large language modelsAli Al-Kaswan, and Maliheh IzadiIn 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE), 2023
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
2022
- CodeFill: Multi-token Code Completion by Jointly learning from Structure and Naming SequencesMaliheh Izadi, Roberta Gismondi, and Georgios GousiosIn 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), 2022
Code completion is an essential feature of IDEs, yet current auto-completers are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer’s code context.In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train …
- Predicting the objective and priority of issue reports in software repositoriesMaliheh Izadi, Kiana Akbari, and Abbas HeydarnooriEmpirical Software Engineering, 2022
Software repositories such as GitHub host a large number of software entities. Developers collaboratively discuss, implement, use, and share these entities. Proper documentation plays an important role in successful software management and maintenance. Users exploit Issue Tracking Systems, a facility of software repositories, to keep track of issue reports, to manage the workload and processes, and finally, to document the highlight of their team’s effort. An issue report is a rich source of collaboratively-curated software knowledge, and can contain a reported problem, a request for new features, or merely a question about the software product. As the number of these issues increases, it becomes harder to manage them manually. GitHub provides labels for tagging issues, as a means of issue management. However, about half of the issues in GitHub’s top 1000 repositories do not have any labels. In this work, we aim at …
- An empirical study on data leakage and generalizability of link prediction models for issues and commitsMaliheh Izadi, Pooya Rostami Mazrae, Tom Mens, and Arie DeursenarXiv preprint arXiv:2211.00381, 2022
To enhance documentation and maintenance practices, developers conventionally establish links between related software artifacts manually. Empirical research has revealed that developers frequently overlook this practice, resulting in significant information loss. To address this issue, automatic link recovery techniques have been proposed. However, these approaches primarily focused on improving prediction accuracy on randomly-split datasets, with limited attention given to the impact of data leakage and the generalizability of the predictive models. LinkFormer seeks to address these limitations. Our approach not only preserves and improves the accuracy of existing predictions but also enhances their alignment with real-world settings and their generalizability. First, to better utilize contextual information for prediction, we employ the Transformer architecture and fine-tune multiple pre-trained models on both textual and metadata information of issues and commits. Next, to gauge the effect of time on model performance, we employ two splitting policies during both the training and testing phases; randomly- and temporally-split datasets. Finally, in pursuit of a generic model that can demonstrate high performance across a range of projects, we undertake additional fine-tuning of LinkFormer within two distinct transfer-learning settings. Our findings support that to simulate real-world scenarios effectively, researchers must maintain the temporal flow of data when training models. Furthermore, the results demonstrate that LinkFormer outperforms existing methodologies by a significant margin, achieving a 48% improvement in F1-measure within a …
- CatIss: An Intelligent Tool for Categorizing Issues Reports using TransformersMaliheh IzadiIn 2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE), 2022
Users use Issue Tracking Systems to keep track and manage issue reports in their repositories. An issue is a rich source of software information that contains different reports including a problem, a request for new features, or merely a question about the software product. As the number of these issues increases, it becomes harder to manage them manually. Thus, automatic approaches are proposed to help facilitate the management of issue reports. This paper describes CatIss, an automatic Categorizer of Issue reports which is built upon the Transformer-based pre-trained RoBERTa model. CatIss classifies issue reports into three main categories of Bug report, Enhancement/feature request, and Question. First, the datasets provided for the NLBSE tool competition are cleaned and preprocessed. Then, the pre-trained RoBERTa model is fine-tuned on the preprocessed dataset. Evaluating CatIss on about 80 …
- On the Evaluation of NLP-based Models for Software EngineeringMaliheh Izadi, and Matin Nili AhmadabadiIn 2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE), 2022
NLP-based models have been increasingly incorporated to address SE problems. These models are either employed in the SE domain with little to no change, or they are greatly tailored to source code and its unique characteristics. Many of these approaches are considered to be outperforming or complementing existing solutions. However, an important question arises here: Are these models evaluated fairly and consistently in the SE community?. To answer this question, we reviewed how NLP-based models for SE problems are being evaluated by researchers. The findings indicate that currently there is no consistent and widely-accepted protocol for the evaluation of these models. While different aspects of the same task are being assessed in different studies, metrics are defined based on custom choices, rather than a system, and finally, answers are collected and interpreted case by case. Consequently, there …
2021
- Automated recovery of issue-commit links leveraging both textual and non-textual dataPooya Rostami Mazrae, Maliheh Izadi, and Abbas HeydarnooriIn 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2021
An issue report documents the discussions around required changes in issue-tracking systems, while a commit contains the change itself in the version control systems. Recovering links between issues and commits can facilitate many software evolution tasks such as bug localization, defect prediction, software quality measurement, and software documentation. A previous study on over half a million issues from GitHub reports only about 42.2% of issues are manually linked by developers to their pertinent commits. Automating the linking of commit-issue pairs can contribute to the improvement of the said tasks. By far, current state-of-the-art approaches for automated commit-issue linking suffer from low precision, leading to unreliable results, sometimes to the point that imposes human supervision on the predicted links. The low performance gets even more severe when there is a lack of textual information in either …
- Topic recommendation for software repositories using multi-label classification algorithmsMaliheh Izadi, Abbas Heydarnoori, and Georgios GousiosEmpirical Software Engineering, 2021
Many platforms exploit collaborative tagging to provide their users with faster and more accurate results while searching or navigating. Tags can communicate different concepts such as the main features, technologies, functionality, and the goal of a software repository. Recently, GitHub has enabled users to annotate repositories with topic tags. It has also provided a set of featured topics, and their possible aliases, carefully curated with the help of the community. This creates the opportunity to use this initial seed of topics to automatically annotate all remaining repositories, by training models that recommend high-quality topic tags to developers. In this work, we study the application of multi-label classification techniques to predict software repositories’ topics. First, we map the large-space of user-defined topics to those featured by GitHub. The core idea is to derive more information from projects’ available …
2020
- Generating summaries for methods of event-driven programs: An Android case studyAlireza Aghamohammadi, Maliheh Izadi, and Abbas HeydarnooriJournal of Systems and Software, 2020
The lack of proper documentation makes program comprehension a cumbersome process for developers. Source code summarization is one of the existing solutions to this problem. Many approaches have been proposed to summarize source code in recent years. A prevalent weakness of these solutions is that they do not pay much attention to interactions among elements of software. An element is simply a callable code snippet such as a method or even a clickable button. As a result, these approaches cannot be applied to event-driven programs, such as Android applications, because they have specific features such as numerous interactions between their elements. To tackle this problem, we propose a novel approach based on deep neural networks and dynamic call graphs to generate summaries for methods of event-driven programs. First, we collect a set of comment/code pairs from Github and train a deep …
- Improving quality of a post’s set of answers in stack overflowMohammadreza Tavakoli, Maliheh Izadi, and Abbas HeydarnooriIn 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 2020
Community Question Answering platforms such as Stack Overflow help a wide range of users solve their challenges on-line. As the popularity of these communities has grown over the years, both the number of members and posts have escalated. Also, due to the diverse backgrounds, skills, expertise, and viewpoints of users, each question may obtain more than one answer. Therefore, the focus has changed toward producing posts that have a set of answers more valuable for the community as a whole, not just one accepted-answer aimed at satisfying only the question-asker. Same as every universal community, a large number of low-quality posts on Stack Overflow require improvement. We call these posts "deficient", and define them as posts with questions that either have no answer yet or can be improved by other ones. In this paper, we propose an approach to automate the identification process of such posts …
2018
- Evaluating collaborative filtering recommender algorithms: a surveyMahdi Jalili, Sajad Ahmadian, Maliheh Izadi, Parham Moradi , and 1 more authorIEEE access, 2018
Due to the explosion of available information on the Internet, the need for effective means of accessing and processing them has become vital for everyone. Recommender systems have been developed to help users to find what they may be interested in and business owners to sell their products more efficiently. They have found much attention in both academia and industry. A recommender algorithm takes into account user–item interactions, i.e., rating (or purchase) history of users on items, and their contextual information, if available. It then provides a list of potential items for each target user, such that the user is likely to positively rate (or purchase) them. In this paper, we review evaluation metrics used to assess performance of recommendation algorithms. We also survey a number of classical and modern recommendation algorithms and compare their performance in terms of different evaluation metrics on …
2017
- The Intonation Patterns of English and Persian Sentences: A Contrastive StudyMehri Izadi, Malihe Izadi, and Behnam AzarsaResearch Journal of Education, 2017
Different intonation pattern is one of the factors affecting the learning of L2 pronunciation. The contrastive analysis of English-Persian intonation patterns has shown that both languages are similar in sentence-final intonation while they are different in incomplete sentences. To this end, this paper describes English-Persian intonation patterns to look at the differences and similarities of the two languages to improve the effectiveness of L2 learning.Â
2015
- Recommender systems for social networks analysis and mining: precision versus diversityAmin Javari, Malihe Izadi, and Mahdi Jalili2015
Recommender systems has become increasingly important in online community for providing personalized services and products to users. Traditionally, performance of recommender algorithms has been evaluated based on accuracy and the focus of the research was on providing accurate recommendation lists. However, recently diversity and novelty of recommendation lists have been introduced as key issues in designing recommender systems. In general, novelty/diversity and accuracy do not go hand in hand. Therefore, designing models answering novelty/diversity-accuracy dilemma is one of the challenging problems in the context of practical recommender systems. In this paper, we first introduce the diversity-accuracy challenge in recommender systems, and then present two recommendation algorithms which approach the problem from two perspectives. The first model is a filtering algorithm to …
2014
- Unifying inconsistent evaluation metrics in recommender systemsMaliheh Izadi, Amin Javari, and Mahdi JaliliiProceedings RecSys Conference, REDD Workshop, 2014
Recommender systems are among the most popular tools used by online community these days. Traditionally, recommender techniques were evaluated using accuracy-based metrics such as precision; however, gradually the need for other qualities including more novel and diverse items emerged. Consequently, researchers started to evaluate their finding with different and often inconsistent metrics and made it nearly impossible to compare the existing approaches properly. It is clear that we need a more unified approach to assess the results of new techniques, and to the best of our knowledge, this problem has not been answered yet in previous studies. In this paper, we proposed a novel and extensible framework for evaluation of recommender systems using maximum bounds of possible measures in different datasets. Finally we provided the results of applying this framework on a set of different recommender algorithms.