Cobalt

Q1

Using Intelligent Screening Service Platform (ISSP) to improve the screening process of clinical trial subjects during COVID-19 pandemic: an experimental study

Li B., Guo R., Zhou H., Liu Y., Zhang X., Zhang Q.

Q1

Data Intelligence, 2024, цитирований: 0,

open access

,

doi.org, Abstract

Abstract Background: During the COVID-19 pandemic, clinical trial recruitment cannot be carried out due to travel restrictions, transmission risks and other factors, resulting in the stagnation of a large number of ongoing or upcoming clinical trials. Objective: An intelligent screening app was developed using artificial intelligence technology to rapidly pre-screen potential patients for phase I solid tumor drug clinical trials. Methods: A total of 429 screening process records were collected from 27 phase I solid tumor drug clinical trials at the First Affiliated Hospital of Bengbu Medical College from April 2018 to May 2021. Features of the experimental data were analyzed, and the collinearity (principal component analysis) and strong correlation (χ2 test) among features were eliminated. XGBoost, Random Forest, and Naive Bayes were used to sort the weight importance of features. Finally, the pre-screening models were constructed using classification machine learning algorithm, and the optimal model was selected. Results: Among the 429 screening records, 33 were data generated by repeated subject participation in different clinical trials, and of the remaining 396 screening records, 246 (62.12%) were screened successfully. The gold standard for subject screening success is the final judgment made by the principal investigator (PI) based on the clinical trial protocol. A Venn diagram was used to identify the important feature intersections of machine learning algorithms. After intersecting the top 15 characteristic variables of different feature screening models, 9 common variables were obtained: age, sex, distance from residence to the central institution, tumor histology, tumor stage, tumorectomy, the interval from diagnosis/postoperative to screening, chemotherapy, and ECOG (Eastern Cooperative Oncology Group, ECOG) score. To select the optimal subset, the 9 important feature variables were expanded to 12 and 15 feature subsets, and the performance of different feature subsets under different machine learning models was validated. The results showed that optimal performance, accuracy and practicability were achieved using XGBoost with the 12 feature subset. The final model could accurately predict the screening success rates in both internal (AUC =3D 0.895) and external (AUC =3D 0.796) validation, and has been transformed into a convenient tool to facilitate its application in the clinical settings. Subjects with a probability exceeding or equaling to the threshold in the final model had a higher probability to be successfully screened. Conclusion: Based on the optimal model, we created an online prediction calculator and visualization app -- ISSP (Intelligent Screening Service Platform), which can rapidly screen patients for phase I solid tumor drug clinical trials. ISSP can effectively solve the problem of space and time interval. On the mobile terminal, it realizes the matching between clinical trial projects and patients, and completes the rapid screening of clinical trial subjects, so as to obtain more clinical trial subjects. As an auxiliary tool, ISSP optimizes the screening process of clinical trials and provides more convenient services for clinical investigators and patients.

Q1

Sustainable Connectivity in a Community Repository

Habermann T.

Q1

Data Intelligence, 2024, цитирований: 0,

open access

,

doi.org, Abstract

Abstract Persistent identifiers for research objects, researchers, organizations, and funders are the key to creating unambiguous and persistent connections across the global research infrastructure (GRI). Many repositories are implementing mechanisms to collect and integrate these identifiers into their submission and record curation processes. This bodes well for a well-connected future, but metadata for existing resources submitted in the past are missing these identifiers, thus missing the connections required for inclusion in the connected infrastructure. Re-curation of these metadata is required to make these connections. This paper introduces the global research infrastructure and demonstrates how repositories, and their user communities, can contribute to and benefit from connections to the global research infrastructure. The Dryad Data Repository has existed since 2008 and has successfully re-curated the repository metadata several times, adding identifiers for research organizations, funders, and researchers. Understanding and quantifying these successes depends on measuring repository and identifier connectivity. Metrics are described and applied to the entire repository here. Identifiers (Digital Object Identifiers, DOIs) for papers connected to datasets in Dryad have long been a critical part of the Dryad metadata creation and curation processes. Since 2019, the portion of datasets with connected papers has decreased from 100% to less than 40%. This decrease has significant ramifications for the re-curation efforts described above as connected papers have been an important source of metadata. In addition, missing connections to papers make understanding and re-using datasets more difficult. Connections between datasets and papers can be difficult to make because of time lags between submission and publication, lack of clear mechanisms for citing datasets and other research objects from papers, changing focus of researchers, and other obstacles. The Dryad community of members, i.e. users, research institutions, publishers, and funders have vested interests in identifying these connections and critical roles in the curation and re-curation efforts. Their engagement will be critical in building on the successes Dryad has already achieved and ensuring sustainable connectivity in the future.

Q1

LLaMA-LoRA Neural Prompt Engineering: A Deep Tuning Framework for Automatically Generating Chinese Text Logical Reasoning Thinking Chains

Chen S., Wang W., Chen X., Lu P., Yang Z., Du Y.

Q1

Data Intelligence, 2024, цитирований: 0,

open access

,

doi.org, Abstract

Abstract The exption of Chinese natural language processing (NLP) has stimulated research in the broader NLP domain. However, existing large language models have limitations in comprehending and reasoning in Chinese. This paper addresses these limitations by enhancing Chinese language models comprehension and reasoning capabilities while minimizing resource requirements. We propose LLaMA-LoRA, a neural prompt engineering framework that builds upon the LLaMA-13B model and incorporates the Low-Rank Adaptation(LoRA) of Large Language Models technique for refinement. Chain-of-Thought(CoT) are crucial for generating intermediate reasoning chains in language models, but their effectiveness can be limited by isolated language patterns. Erroneous reasoning resulting from conventional prompts negatively impacts model performance. Automatic prompts are introduced to encourage reasoning chain generation and accurate answer inference. Training the model with an extensive corpus of Chinese CoT data enhances its comprehension and reasoning abilities. The LLaMA-LoRA model demonstrates exceptional performance across numerous Chinese language tasks, surpassing benchmark performance achieved by related language models such as GPT-3.5, Chat-GLM, and OpenAssistant, delivering accurate, comprehensive, and professional answers. The availability of our open-source model code facilitates further research in the field of Chinese text logical reasoning thinking chains.

Q1

Public Opinions on ChatGPT: An Analysis of Reddit Discussions by Using Sentiment Analysis, Topic Modeling, and SWOT Analysis

Naing S.Z., Udomwong P.

Q1

Data Intelligence, 2024, цитирований: 3,

open access

,

doi.org, Abstract

ABSTRACT The sudden arrival of AI (Artificial Intelligence) into people's daily lives all around the world was marked by the introduction of ChatGPT, which was officially released on November 30, 2022. This AI invasion in our lives drew the attention of not only tech enthusiasts but also scholars from diverse fields, as its capacity extends across various fields. Consequently, numerous articles and journals have been discussing ChatGPT, making it a headline for several topics. However, it does not reflect most public opinion about the product. Therefore, this paper investigated the public's opinions on ChatGPT through topic modelling, Vader-based sentiment analysis and SWOT analysis. To gather data for this study, 202905 comments from the Reddit platform were collected between December 2022 and December 2023. The findings reveal that the Reddit community engaged in discussions related to ChatGPT, covering a range of topics including comparisons with traditional search engines, the impacts on software development, job market, and education industry, exploring ChatGPT's responses on entertainment and politics, the responses from Dan, the alter ego of ChatGPT, the ethical usage of user data as well as queries related to the AI-generated images. The sentiment analysis indicates that most people hold positive views towards this innovative technology across these several aspects. However, concerns also arise regarding the potential negative impacts associated with this product. The SWOT analysis of these results highlights both the strengths and pain points, market opportunities and threats associated with ChatGPT. This analysis also serves as a foundation for providing recommendations aimed at the product development and policy implementation in this paper.

Q1

Resampling approaches for the quantitative analysis of spatially distributed cells

Bertolazzi G., Tumminello M., Morello G., Belmonte B., Tripodo C.

Q1

Data Intelligence, 2024, цитирований: 0,

open access

,

doi.org, Abstract

ABSTRACT Image segmentation is a crucial step in various image analysis pipelines and constitutes one of the cutting-edge areas of digital pathology. The advent of quantitative analysis has enabled the evaluation of millions of individual cells in tissues, allowing for the combined assessment of morphological features, biomarker expression, and spatial context. The recorded cells can be described as a point pattern process. However, the classical statistical approaches to point pattern processes prove unreliable in this context due to the presence of multiple irregularly-shaped interstitial cell-devoid spaces in the domain, which correspond to anatomical features (e.g. vessels, lipid vacuoles, glandular lumina) or tissue artefacts (e.g. tissue fractures), and whose coordinates are unknown. These interstitial spaces impede the accurate calculation of the domain area, resulting in biased clustering measurements. Moreover, the mistaken inclusion of empty regions of the domain can directly impact the results of hypothesis testing. The literature currently lacks any introduced bias correction method to address interstitial cell-devoid spaces. To address this gap, we propose novel resampling methods for testing spatial randomness and evaluating relationships among different cell populations. Our methods obviate the need for domain area estimation and provide non-biased clustering measurements. We created the SpaceR software (https://github.com/GBertolazzi/SpaceR) to enhance the accessibility of our methodologies.

Q1

Training Generative Adversarial Networks with Adaptive Composite Gradient

Qi H., Li F., Tan S., Zhang X.

Q1

Data Intelligence, 2024, цитирований: 0,

open access

,

doi.org, Abstract

ABSTRACT The wide applications of Generative adversarial networks benefit from the successful training methods, guaranteeing that an object function converges to the local minimum. Nevertheless, designing an efficient and competitive training method is still a challenging task due to the cyclic behaviors of some gradient-based ways and the expensive computational cost of acquiring the Hessian matrix. To address this problem, we proposed the Adaptive Composite Gradients(ACG) method, linearly convergent in bilinear games under suitable settings. Theory analysis and toy-function experiments both suggest that our approach alleviates the cyclic behaviors and converges faster than recently proposed SOTA algorithms. The convergence speed of the ACG is improved by 33% than other methods. Our ACG method is a novel Semi-Gradient-Free algorithm that can reduce the computational cost of gradient and Hessian by utilizing the predictive information in future iterations. The mixture of Gaussians experiments and real-world digital image generative experiments show that our ACG method outperforms several existing technologies, illustrating the superiority and efficacy of our method.

Q1

Applying a Context-based Method to Build a Knowledge Graph for the Blue Amazon

Ligabue P.D., Brandão A.A., Peres S.M., Cozman F.G., Pirozelli P.

Q1

Data Intelligence, 2024, цитирований: 0,

open access

,

doi.org, Abstract

ABSTRACT Knowledge graphs are employed in several tasks, such as question answering and recommendation systems, due to their ability to represent relationships between concepts. Automatically constructing such a graphs, however, remains an unresolved challenge within knowledge representation. To tackle this challenge, we propose CtxKG, a method specifically aimed at extracting knowledge graphs in a context of limited resources in which the only input is a set of unstructured text documents. CtxKG is based on OpenIE (a relationship triple extraction method) and BERT (a language model) and contains four stages: the extraction of relationship triples directly from text; the identification of synonyms across triples; the merging of similar entities; and the building of bridges between knowledge graphs of different documents. Our method distinguishes itself from those in the current literature (i) through its use of the parse tree to avoid the overlapping entities produced by base implementations of OpenIE; and (ii) through its bridges, which create a connected network of graphs, overcoming a limitation similar methods have of one isolated graph per document. We compare our method to two others by generating graphs for movie articles from Wikipedia and contrasting them with benchmark graphs built from the OMDb movie database. Our results suggest that our method is able to improve multiple aspects of knowledge graph construction. They also highlight the critical role that triple identification and named-entity recognition have in improving the quality of automatically generated graphs, suggesting future paths for investigation. Finally, we apply CtxKG to build BlabKG, a knowledge graph for the Blue Amazon, and discuss possible improvements.

Q1

Price Mechanism, Government Constraints and Carbon Trading Pilot Policy for Emission Reduction

Wei H., Haili X., Qin Z., Xiao L., Haoguang L.

Q1

Data Intelligence, 2024, цитирований: 0,

open access

,

doi.org, Abstract

ABSTRACT Based on the data of 247 cities at the prefecture level in China from 2007 to 2019, this paper analyzes the impact of the carbon emissions trading CET pilot policy on carbon emission reduction from the perspective of the price mechanism and government constraints. The results show that the carbon emissions and carbon intensity in the pilot areas are significantly reduced by adjusting the industrial structure and promoting green technology innovation. In terms of regions, the emission reduction effect of the pilot policy in regions with a high proportion of industry is obviously weaker than that in other regions. The aim of the carbon emission trading policy in China that achieve carbon emission reduction is by coordinating the carbon emission trading price that fail to fulfill this aim independently and the degree of government punishment for enterprises.

Q1

Benchmarks for Pirá 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change

Pirozelli P., José M.M., Silveira I., Nakasato F., Peres S.M., Brandão A.A., Costa A.H., Cozman F.G.

Q1

Data Intelligence, 2024, цитирований: 0,

open access

,

doi.org, Abstract

ABSTRACT Pirá is a reading comprehension dataset focused on the ocean, the Brazilian coast, and climate change, built from a collection of scientific abstracts and reports on these topics. This dataset represents a versatile language resource, particularly useful for testing the ability of current machine learning models to acquire expert scientific knowledge. Despite its potential, a detailed set of baselines has not yet been developed for Pirá. By creating these baselines, researchers can more easily utilize Pirá as a resource for testing machine learning models across a wide range of question answering tasks. In this paper, we define six benchmarks over the Pirá dataset, covering closed generative question answering, machine reading comprehension, information retrieval, open question answering, answer triggering, and multiple choice question answering. As part of this effort, we have also produced a curated version of the original dataset, where we fixed a number of grammar issues, repetitions, and other shortcomings. Furthermore, the dataset has been extended in several new directions, so as to face the aforementioned benchmarks: translation of supporting texts from English into Portuguese, classification labels for answerability, automatic paraphrases of questions and answers, and multiple choice candidates. The results described in this paper provide several points of reference for researchers interested in exploring the challenges provided by the Pirá dataset.

Q1

The stance and factors of international organizations towards countries from a Chinese perspective

Qin Z., Haili X., Yaotian W., Yao W., Ziqin Z., Qinghua Q., Haoguang L.

Q1

Data Intelligence, 2024, цитирований: 0,

open access

,

doi.org, Abstract

ABSTRACT The degree and scope of constraints imposed by International Organizations (IOs) on States are increasing, and identifying the factors affecting the IOs' stances on States is helpful to enhance the state's discourse power and influence in the international community. First, by coding the records of Regular Press Conferences of the Speaking Office of the Chinese Ministry of Foreign Affairs during the period of 2018–2022, we obtained a dataset of IOs' stances on China-related events. Second, we constructed political relation, economic relation, and humanistic relation indicators to complement the influence factors, adopted the Bayesian logit model, and applied the Monte Carlo Markov chain algorithm and Gibbs sampling to analyze the probability of IOs' positive stances towards China. The result shows that IOs' category, length of establishment, functional position, and relationship with China are all related to their tendency of making a statement about China. In terms of the heterogeneity of event types, forum-type IOs are significantly inclined to give positive assessment compared to service-type IOs on events focusing on China's own development. Further analysis reveals that the model for analyzing and predicting the attitudes of IOs is more effective when the international situation is in a stable period.

Q1

Optimizing ASReview simulations: A generic multiprocessing solution for ‘light-data’ and ‘heavy-data’ users

Romanov S., Siqueira A.S., de Bruin J., Teijema J., Hofstee L., van de Schoot R.

Q1

Data Intelligence, 2024, цитирований: 3,

open access

,

Обзор, doi.org, Abstract

ABSTRACT Active learning can be used for optimizing and speeding up the screening phase of systematic reviews. Running simulation studies mimicking the screening process can be used to test the performance of different machine-learning models or to study the impact of different training data. This paper presents an architecture design with a multiprocessing computational strategy for running many such simulation studies in parallel, using the ASReview Makita workflow generator and Kubernetes software for deployment with cloud technologies. We provide a technical explanation of the proposed cloud architecture and its usage. In addition to that, we conducted 1140 simulations investigating the computational time using various numbers of CPUs and RAM settings. Our analysis demonstrates the degree to which simulations can be accelerated with multiprocessing computing usage. The parallel computation strategy and the architecture design that was developed in the present paper can contribute to future research with more optimal simulation time and, at the same time, ensure the safe completion of the needed processes.

Q1

The Limitations and Ethical Considerations of ChatGPT

Hua S., Jin S., Jiang S.

Q1

Data Intelligence, 2023, цитирований: 19,

open access

,

Обзор, doi.org, Abstract

ABSTRACT With the advancements of artificial intelligence technology, ChatGPT, a new practice of artificial intelligence, holds immense potential across multiple fields. Its user-friendly human-machine interface, rapid response capabilities, and delivery of high-quality answers have attracted considerable attention and widespread usage. Regarded by many as a groundbreaking advancement in AI, ChatGPT represents a new milestone in the field. However, as with any technological evolution, the emergence of ChatGPT brings not only benefits, but also inevitable security risks and ethical issues. This paper provides specific information about ChatGPT, including its technology, limitations, ethical issues, governance paths and future directions. Specifically, we firstly offered a thorough exploration of the technical implementation details of GPT series models. Next, we provided an intricate analysis elucidating the reasons for limitations and scrutinized the consequential impacts, such as malicious misuse, privacy violation, and so on. Finally, we explore diverse governance paths to mitigate the impacts of ChatGPT and present future directions. This review aims to equip users with crucial knowledge, facilitating well-informed decision-making, effectively handling of potential challenges in employing ChatGPT, and staying abreast with the rapidly evolving landscape of this technology.

Q1

Classification and quantification of timestamp data quality issues and its impact on data quality outcome

Ambe R.

Q1

Data Intelligence, 2023, цитирований: 0,

open access

,

doi.org, Abstract

Abstract Timestamps play a key role in process mining because it determines the chronology of which events occurred and subsequently how they are ordered in process modelling. The timestamp in process mining gives an insight on process performance, conformance, and modelling. This therefore means problems with the timestamp will result in misrepresentations of the mined process. A few articles have been published on the quantification of data quality problems but just one of the articles at the time of this paper is based on the quantification of timestamp quality problems. This article evaluates the quality of timestamps in event log across two axes using eleven quality dimensions and four levels of potential data quality problems. The eleven data quality dimensions were obtained by doing a thorough literature review of more than fifty process mining articles which focus on quality dimensions. This evaluation resulted in twelve data quality quantification metrics and the metrics were applied to the MIMIC-III dataset as an illustration. The outcome of the timestamp quality quantification using the proposed typology enabled the user to appreciate the quality of the event log and thus makes it possible to evaluate the risk of carrying out specific data cleaning measures to improve the process mining outcome.

Q1

Rule Mining Trends from 1987 to 2022: A Bibliometric Analysis and Visualization

Zhou S., Bi S., Qi G.

Q1

Data Intelligence, 2023, цитирований: 0,

open access

,

doi.org, Abstract

Abstract Rule mining has emerged as a crucial technique in data mining and knowledge discovery, enabling the extraction of valuable insights and patterns from vast datasets. This has garnered significant attention from both academic and industrial communities. However, there is a lack of bibliometric and visualization research on rule mining, leading to an unclear delineation of research topics and trends in the field. To fill this gap, this paper provides a comprehensive and up-to-date bibliometric analysis of rule mining, covering 4524 publications published between 1987 and 2022. Using various metrics and visualization techniques, we examine the patterns, trends, and evolution of rule mining. The results show a sustained growth in rule mining research, with a significant increase in publication output in recent years, and its rapid expansion into new areas such as explainable artificial intelligence and privacy protection. While the majority of publications come from Asia, the National Natural Science Foundation of China emerges as the top funding agency in the field. We also identify highly productive authors and significant members of co-authorship networks, as well as the most influential publications and citation bursts. The need for international collaboration and the integration of diverse research perspectives is highlighted. Despite the progress in rule mining, several challenges still require further research, including scalability and efficiency, explainability, network security and privacy protection, and personalized and user-centered design. Overall, this paper provides a valuable roadmap for researchers, policymakers, and practitioners interested in rule-mining research.

Q1

The W3C Data Catalog Vocabulary, Version 2: Rationale, Design Principles, and Uptake

Albertoni R., Browning D., Cox S., Gonzalez-Beltran A.N., Perego A., Winstanley P.

Q1

Data Intelligence, 2023, цитирований: 6,

open access

,

doi.org, Abstract

Abstract DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. Since its first release in 2014 as a W3C Recommendation, DCAT has seen a wide adoption across communities and domains, particularly in conjunction with implementing the FAIR data principles (for findable, accessible, interoperable and reusable data). These implementation experiences, besides demonstrating the fitness of DCAT to meet its intended purpose, helped identify existing issues and gaps. Moreover, over the last few years, additional requirements emerged in data catalogs, given the increasing practice of documenting not only datasets but also data services and APIs. This paper illustrates the new version of DCAT, explaining the rationale behind its main revisions and extensions, based on the collected use cases and requirements, and outlines the issues yet to be addressed in future versions of DCAT.

Q1

Exploring Attentive Siamese LSTM for Low-Resource Text Plagiarism Detection

Bao W., Dong J., Xu Y., Yang Y., Qi X.

Q1

Data Intelligence, 2023, цитирований: 0,

open access

,

doi.org, Abstract

Abstract Low-resource text plagiarism detection faces a significant challenge due to the limited availability of labeled data for training. This task requires the development of sophisticated algorithms capable of identifying similarities and differences in texts, particularly in the realm of semantic rewriting and translation-based plagiarism detection. In this paper, we present an enhanced attentive Siamese Long Short-Term Memory (LSTM) network designed for Tibetan-Chinese plagiarism detection. Our approach begins with the introduction of translation-based data augmentation, aimed at expanding the bilingual training dataset. Subsequently, we propose a pre-detection method leveraging abstract document vectors to enhance detection efficiency. Finally, we introduce an improved attentive Siamese LSTM network tailored for Tibetan-Chinese plagiarism detection. We conduct comprehensive experiments to showcase the effectiveness of our proposed plagiarism detection framework.

Q1

BIKAS: Bio-Inspired Knowledge Acquisition and Simulacrum—A Knowledge Database to Support Multifunctional Design Concept Generation

Velivela P.T., Zhao Y.F.

Q1

Data Intelligence, 2023, цитирований: 1,

open access

,

doi.org, Abstract

Abstract A detailed acquisition, analysis, and representation of biological systems exhibiting different functions is required to develop unique bio-inspired multifunctional conceptual designs and methods. This paper presents BIKAS: Bio-inspired Knowledge Acquisition and Simulacrum, a knowledge database of biological systems exhibiting various functionalities, developed based on case-based bio-inspired examples from literature. The knowledge database represents the biological features, their characteristics, and the function exhibited by the biological feature as a combination of its integrated structure and structural strategy. Furthermore, this knowledge database is utilized by the Expandable Domain Integrated Design (xDID) model that works on classifying, mapping, and representing biological features into their respective geometric designations called Domains. The combination of features from the Domains results in the generation of multifunctional conceptual designs. In addition, Meta-level design factors are proposed to aid designers in filtering the biological features and their respective functions having a similar structural strategy, thus aiding designers in rapidly selecting and emulating biological functions.

Q1

Comparison of Parallel Genetic Algorithm and Particle Swarm Optimization for Parameter Calibration in Hydrological Simulation

Zhang X., Li Y., Chu G.

Q1

Data Intelligence, 2023, цитирований: 2,

open access

,

doi.org, Abstract

Parameter calibration is an important part of hydrological simulation and affects the final simulation results. In this paper, we introduce heuristic optimization algorithms, genetic algorithm (GA) to cope with the complexity of the parameter calibration problem, and use particle swarm optimization algorithm (PSO) as a comparison. For large scale hydrological simulations, we use a multilevel parallel parameter calibration framework to make full use of processor resources, accelerate the process of solving high-dimensional parameter calibration. Further, we test and apply the experiments on domestic supercomputers. The results of parameter calibration with GA and PSO can basically reach the ideal value of 0.65 and above, with PSO achieving a speedup of 58.52 on TianHe-2 supercomputer. The experimental results indicate that by using a parallel implementation on multicore CPUs, high-dimensional parameter calibration in large scale hydrological simulation is possible. Moreover, our comparison of the two algorithms shows that the GA obtains better calibration results, and the PSO has a more pronounced acceleration effect.

Q1

A Theoretically Grounded Question Answering Data Set for Evaluating Machine Common Sense

Santos H., Shen K., Mulvehill A.M., Kejriwal M., McGuinness D.L.

Q1

Data Intelligence, 2023, цитирований: 1,

open access

,

doi.org, Abstract

Abstract Achieving machine common sense has been a longstanding problem within Artificial Intelligence. Thus far, benchmark data sets that are grounded in a theory of common sense and can be used to conduct rigorous, semantic evaluations of common sense reasoning (CSR) systems have been lacking. One expectation of the AI community is that neuro-symbolic reasoners can help bridge this gap towards more dependable systems with common sense. We propose a novel benchmark, called Theoretically Grounded common sense Reasoning (TG-CSR), modeled as a set of question answering instances, with each instance grounded in a semantic category of common sense, such as space, time, and emotions. The benchmark is few-shot i.e., only a few training and validation examples are provided in the public release to avoid the possibility of overfitting. Results from recent evaluations suggest that TG-CSR is challenging even for state-of-the-art statistical models. Due to its semantic rigor, this benchmark can be used to evaluate the common sense reasoning capabilities of neuro-symbolic systems.

Q1

ChatGPT is a Remarkable Tool—For Experts

Azaria A., Azoulay R., Reches S.

Q1

Data Intelligence, 2023, цитирований: 29,

open access

,

doi.org, Abstract

Abstract This paper investigates the capabilities of ChatGPT as an automated assistant in diverse domains, including scientific writing, mathematics, education, programming, and healthcare. We explore the potential of ChatGPT to enhance productivity, streamline problem-solving processes, and improve writing style. Furthermore, we highlight the potential risks associated with excessive reliance on ChatGPT in these fields. These limitations encompass factors like incorrect and fictitious responses, inaccuracies in code, limited logical reasoning abilities, overconfidence, and critical ethical concerns of copyright and privacy violation. We outline areas and objectives where ChatGPT proves beneficial, applications where it should be used judiciously, and scenarios where its reliability may be limited. In light of observed limitations, and given that the tool's fundamental errors may pose a special challenge for non-experts, ChatGPT should be used with a strategic methodology. By drawing from comprehensive experimental studies, we offer methods and flowcharts for effectively using ChatGPT. Our recommendations emphasize iterative interaction with ChatGPT and independent verification of its outputs. Considering the importance of utilizing ChatGPT judiciously and with expertise, we recommend its usage for experts who are well-versed in the respective domains.

Q1

Building expertise on FAIR through evolving Bring Your Own Data (BYOD) workshops: describing the data, software, and management- focused approaches and their evolution

Bernabé C.H., Thielemans L., Kaliyaperumal R., Carta C., Zhang S., van Gelder C.W., Benis N., da Silva Santos L.O., Cornet R., Vieira B.D., Lalout N., Henriques I., Ballesteros A.C., Burger K., Kersloot M.G., et. al.

Q1

Data Intelligence, 2023, цитирований: 0,

open access

,

doi.org, Abstract

Abstract Since 2014, “Bring Your Own Data” workshops (BYODs) have been organised to inform people about the process and benefits of making resources Findable, Accessible, Interoperable, and Reusable (FAIR, and the FAIRification process). The BYOD workshops’ content and format differ depending on their goal, context, and the background and needs of participants. Data-focused BYODs educate domain experts on how to make their data FAIR to find new answers to research questions. Management-focused BYODs promote the benefits of making data FAIR and instruct project managers and policy-makers on the characteristics of FAIRification projects. Software-focused BYODs gather software developers and experts on FAIR to implement or improve software resources that are used to support FAIRification. Overall, these BYODs intend to foster collaboration between different types of stakeholders involved in data management, curation, and reuse (e.g. domain experts, trainers, developers, data owners, data analysts, FAIR experts). The BYODs also serve as an opportunity to learn what kind of support for FAIRification is needed from different communities and to develop teaching materials based on practical examples and experience. In this paper, we detail the three different structures of the BYODs and describe examples of early BYODs related to plant breeding data, and rare disease registries and biobanks, which have shaped the structure of the workshops. We discuss the latest insights into making BYODs more productive by leveraging our almost ten years of training experience in these workshops, including successes and encountered challenges. Finally, we examine how the participants’ feedback has motivated the research on FAIR, including the development of workflows and software.

Q1

Improving Extraction of Chinese Open Relations Using Pre-trained Language Model and Knowledge Enhancement

Wen C., Jia X., Chen T.

Q1

Data Intelligence, 2023, цитирований: 0,

open access

,

doi.org, Abstract

Abstract Open Relation Extraction (ORE) is a task of extracting semantic relations from a text document. Current ORE systems have significantly improved their efficiency in obtaining Chinese relations, when compared with conventional systems which heavily depend on feature engineering or syntactic parsing. However, the ORE systems do not use robust neural networks such as pre-trained language models to take advantage of large-scale unstructured data effectively. In respons to this issue, a new system entitled Chinese Open Relation Extraction with Knowledge Enhancement (CORE-KE) is presented in this paper. The CORE-KE system employs a pre-trained language model (with the support of a Bidirectional Long Short-Term Memory (BiLSTM) layer and a Masked Conditional Random Field (Masked CRF) layer) on unstructured data in order to improve Chinese open relation extraction. Entity descriptions in Wikidata and additional knowledge (in terms of triple facts) extracted from Chinese ORE datasets are used to fine-tune the pre-trained language model. In addition, syntactic features are further adopted in the training stage of the CORE-KE system for knowledge enhancement. Experimental results of the CORE-KE system on two large-scale datasets of open Chinese entities and relations demonstrate that the CORE-KE system is superior to other ORE systems. The F1-scores of the CORE-KE system on the two datasets have given a relative improvement of 20.1% and 1.3%, when compared with benchmark ORE systems, respectively. The source code is available at https://github.com/cjwen15/CORE-KE.

Q1

Evaluation Index System of Green Public Open Space Based on Internet of Things and Mental Health

Li J., Aziz F.B., Zhang N.

Q1

Data Intelligence, 2023, цитирований: 0,

open access

,

doi.org, Abstract

Abstract With the emergence of the IoT era, wireless sensor networks will be more and more widely used. In addition to collecting, transmitting and processing simple data such as humidity, temperature and density of the dome, they can also provide multimedia information services such as video and images. It enables more comprehensive and accurate environmental monitoring. Therefore, MSDs have a huge demand in military, daily, forestry, biomedicine and other fields. The intensive city model has obvious advantages in meeting people's diverse needs and comfortable life. Most obviously, it speeds up the rhythm of life for residents, thereby increasing efficiency and saving time. Starting from this aspect, this paper conducts a research on the evaluation index system of public built on the following areas of open space IoT and mental health. In this paper, the GRNN neural network model is constructed, the mean condition is calculated, the density function can be estimated, the network output, and the schematic diagram of the generalized regression neural network is improved. Using the established system, the index in 2018 is selected as the base year, and after transformation, the standardized values of the past years are formed, which are substituted into the cells to form different matrices. The value of each cell is counted to obtain the subsystem coordination degree, and the global coordination degree is obtained through calculation. The evaluation results of ecological civilization construction and development in 2018 and 2019, 2020 and 2021 were compared. The experimental data shows that compared with 2018, economic development will change from 1 to 2.000, social harmony will change from 1 to 2.480, ecological health will decrease to 0.850, environmental friendliness will decrease to 0.750, and comprehensive evaluation will decrease to 0.513. This shows that while the economy is developing this year, the construction of ecological civilization has been gradually carried out, and good results have been achieved. This reflects the effectiveness of the system. The subject of the evaluation index system of green public open space based on the Internet of Things and mental health has been well completed.

Q1

Slide-Detect: An Accurate Deep Learning Diagnosis of Lung Infiltration

Mohamed A.E., Fayek M.B., Farouk M.

Q1

Data Intelligence, 2023, цитирований: 2,

open access

,

doi.org, Abstract

ABSTRACT Lung infiltration is a non-communicable condition where materials with higher density than air exist in the parenchyma tissue of the lungs. Lung infiltration can be hard to be detected in an X-ray scan even for a radiologist, especially at the early stages making it a leading cause of death. In response, several deep learning approaches have been evolved to address this problem. This paper proposes the Slide-Detect technique which is a Deep Neural Networks (DNN) model based on Convolutional Neural Networks (CNNs) that is trained to diagnose lung infiltration with Area Under Curve (AUC) up to 91.47%, accuracy of 93.85% and relatively low computational resources.

Q1

Evaluation on ChatGPT for Chinese Language Understanding

Li L., Zhang H., Li C., You H., Cui W.

Q1

Data Intelligence, 2023, цитирований: 8,

open access

,

doi.org, Abstract

Abstract ChatGPT has attracted extension attention of academia and industry. This paper aims to evaluate ChatGPT in Chinese language understanding capability on 6 tasks using 11 datasets. Experiments indicate that ChatGPT achieved competitive results in sentiment analysis, summary, and reading comprehension in Chinese, while it is prone to factual errors in closed-book QA. Further, on two more difficult Chinese understanding tasks, that is, idiom fill-in-the-blank and cants understanding, we found that a simple chain-of-thought prompt can improve the accuracy of ChatGPT in complex reasoning. This paper further analyses the possible risks of using ChatGPT based on the results. Finally, we briefly describe the research and development progress of our ChatBIT.