Browsing by Author "Tekir, Selma"

Now showing 1 - 20 of 23

Automatic question generation using natural language processing techniques
(Izmir Institute of Technology, 2018-07) Keklik, Onur; Tuğlular, Tuğkan; Tuğlular, Tuğkan; Tekir, Selma
This thesis proposes a new rule based approach to automatic question generation. The proposed approach focuses on analysis of both syntactic and semantic structure of a sentence. The design and implementation of the proposed approach are also explained in detail. Although the primary objective of the designed system is question generation from sentences, automatic evaluation results shows that, it also achieves great performance on reading comprehension datasets, which focus on question generation from paragraphs. With respect to human evaluations, the designed system significantly outperforms all other systems and generated the most natural (human-like) questions.
Automatic quote detection from literary work
(01. Izmir Institute of Technology, 2022-12) Güzel Altıntaş, Aybüke; Tekir, Selma
Literature inspires readers, and readers tend to share quotes from a literary work. The reader underlines the quotes in the book and shares them on social media, or on an online platform used by book readers. The definition of a quote is a span in a written text that is interesting for many readers and readers can use the quote in different contexts. In this study, a novel task in the field of Natural Language Processing is proposed: the Quote Detection Task. Also, an original dataset was formed from the Goodreads and Gutenberg websites with web scraping. Quotes are Goodreads data sourced from Kaggle and data that has been voted by 10 or more users are selected. These quotes have been validated with the books on the Project Gutenberg website. The final dataset consists of 4554 rows. The dataset contains quotes with their book spans. The span of a quote consists of the previous 10 sentences of the quote, the quote itself, and the following 10 sentences of the quote. Conditional Random Field (CRF) and Extractive Summarization as Text Matching (MatchSum) were run as two different baselines for quote detection. The Quote Detection Task is span detection that can be modeled with sequence labeling solutions and Neural extractive summarization systems in the literature. For this sequence tagging problem, the statistics-based CRF was run as first baseline. Extractive Summarization as Text Matching baseline is the second baseline chosen for the experimental part. Rouge-1 scores of 27.24% and 40.54%, respectively, were obtained from these baselines.
Automatic story construction from news articles in an online fashion
(Izmir Institute of Technology, 2019-07) Can, Özgür; Tekir, Selma
Every day, thousands of local and global news become online. Each arriving news piece can have a connection with some previous information, but in a large-scale news flow, it is quite difficult for readers to integrate news and evaluate the agenda in the light of past. Thus, grouping news in a coherent way to construct news stories is a fundamental requirement. To meet this requirement, first of all meaningful representation of documents on which the clustering is performed must be extracted, and the new story clusters must be generated on the fly in an online fashion. In this work, we analyze the complex relations of the news articles, and propose a system to generate continuously updated news stories in online fashion. As part of the experimental validation, we provide a step by step construction of a meaningful news story out of news articles that are coming from different sources. The constructed news stories demonstrate the usefulness of the developed system.
Automatic, fast and accurate sequence decontamination
(Izmir Institute of Technology, 2016-07) Bağcı, Caner; Allmer, Jens; Tekir, Selma
The introduction of massively parallel sequencing technologies was a revolutionary step in genomics. Their decreasing cost and powerful features have put them more and more on demand in the last decade. It is now possible to sequence even complete genomes of organisms, using massively parallel sequencing technologies even for small laboratories around the world. However, the power of this powerful technology comes with its challenges. The challenges are both in technological and computational side of the work. In this work, one of these computational challenges is addressed and a novel algorithm is offered to solve the problem. Sequencing by synthesis is one of the methods used in many different massively parallel sequencing instruments. This method utilizes the biological process of DNA replication and with the help of different means of detection, it allows sequencing a DNA molecule while it is replicated. Since DNA polymerase requires a primer to start the replication reaction, short oligonucleotide adapters are used in sequencing by synthesis methods to initiate the reaction. However, certain circumstances allow these adapters to contaminate final sequence reads. Several tools have been offered to trim adapters from reads; but all depend on the prior knowledge of the adapter sequence by the bioinformatician. In this work, an algorithm is offered to detect and trim adapters only using the sequences of reads, without relying on prior knowledge of adapter sequences. The algorithm was shown to perform better or on the same grounds with existing methods in terms of speed and efficiency.
Classification of contradictory opinions in text using deep learning methods
(01. Izmir Institute of Technology, 2020-12) Oğul, İskender Ülgen; Tekir, Selma
Natural language inference (NLI) problem aims to ensure consistency as well as accuracy of propositions while making sense of natural language. Natural language inference aims to classify the relationship between two given sentences as contradiction, entailment or neutrality. To accomplish the classification task, sentences or words must be translated into mathematical representations called vectors or embedding. Vectorization of a sentence is as important as the complexity of the classification model. In this study, both pre-trained (Glove, Fasttext, Word2Vec) and contextual word embedding methods (BERT) were used for comparison and acquire the best result. One of the natural language processing tasks NLI, is highly complex and requires solutions. Conventional machine learning methods are insufficient to carry out natural language processing solutions. Therefore, more advanced solutions are required. This study used deep learning methods to perform the classification task. Unlike conventional machine learning approaches, deep learning approaches reduce errors while increasing accuracy by repeating the data many times. Opinion sentences have complex grammatical structures that are difficult to classify. This study used Decomposable Attention and Enhanced LSTM for natural language inference to perform NLI classification task. Using the advanced LSTM deep learning method and Bert contextual vectors for natural language extraction on the SNLI dataset, an accuracy result 88.0% very close state of the art result 92.1% was obtained. In order to show the usability of the developed solution in different NLI tasks, an accuracy of 80.02% was obtained in the studies performed on the MNLI data set.
Discovering specific semantic relations among words using neural network methods
(Izmir Institute of Technology, 2021-10) Sezerer, Erhan; Tekir, Selma
Human-level language understanding is one of the oldest challenges in computer science. Many scientific work has been dedicated to finding good representations for semantic units (words, morphemes, characters) in languages. Recently, contextual language models, such as BERT and its variants, showed great success in downstream natural language processing tasks with the use of masked language modelling and transformer structures. Although these methods solve many problems in this domain and are proved to be useful, they still lack one crucial aspect of the language acquisition in humans: Experiential (visual) information. Over the last few years, there has been an increase in the studies that consider experiential information by building multi-modal language models and representations. It is shown by several studies that language acquisition in humans start with learning concrete concepts through images and then continue with learning abstract ideas through text. In this work, the curriculum learning method is used to teach the model concrete/abstract concepts through the use of images and corresponding captions to accomplish the task of multi-modal language modeling/representation. BERT and Resnet-152 model is used on each modality with attentive pooling mechanism on the newly constructed dataset, collected from the Wikimedia Commons. To show the performance of the proposed model, downstream tasks and ablation studies are performed. Contribution of this work is two-fold: a new dataset is constructed from Wikimedia Commons and a new multi-modal pre-training approach that is based on curriculum learning is proposed. Results show that the proposed multi-modal pre-training approach increases the success of the model.
Enriching contextual word embeddings with character information
(Izmir Institute of Technology, 2020-07) Polatbilek, Ozan; Tekir, Selma
Natural Language Processing has become more and more popular with the recent advances in Artificial Intelligence. Fundamental improvements have been introduced in word representations to store semantic and/or syntactic features. With the recently published language model BERT, contextual word vectors could be generated. This model do not process character level information. In morphologically rich languages such as Turkish, this model's perception of syntax could be improved. In this thesis, a new model, called BERT-ELMo, which is a combination of BERT and ELMo, is proposed to enrich BERT with character level information. This model combines character level processing part of ELMo and contextual word representation part of the BERT model. To show the effectiveness of the proposed model, both quantitative (question answering) and qualitative (word analogy, word contextualization, morphological meaning, out of vocabulary word capturing) analyses are performed and it is compared with BERT on Turkish language. Thanks to character level addition, proposed model is able get trained in any language without any pre-analysis.To the best of our knowledge, this is the first study which uses morphological analysis to train the BERT model in Turkish, and the first model to integrate a character level module to BERT.
Enrichment of Turkish Question Answering systems using knowledge graphs
(01. Izmir Institute of Technology, 2023-07) Çiftçi, Okan; Soygazi, Fatih; Tekir, Selma
In the era of digital communication, the ability to effectively process and interpret human language has become a key research area. Natural Language Processing (NLP) has emerged as a field that enables machines to better understand and analyze human language. One of the most important applications of NLP is the development of question answering systems, which are essential in various domains such as customer service, search engines, and chatbots. To answer incoming queries, question answering systems rely on knowledge graphs as a reliable source. This thesis proposes a Turkish Question Answering (TRQA) system that utilizes a knowledge graph. The research focuses on the automatic construction of a knowledge graph specific to the film industry, as well as the creation of a multi-hop question-answering dataset that can be queried from this graph. Building upon these constructions, we develop a deep learning based method for answering questions using the constructed knowledge graph. The constructed knowledge graph is compared with various knowledge graphs presented in the literature using DistMult, ComplEx and SimplE methods for the link prediction task. Additionally, the proposed question answering system is compared with the baseline study and compared with a generative large language model through quantitative and qualitative analyses.
An event-based Hidden Makrov Model approach to news classification and sequencing
(Izmir Institute of Technology, 2014) Çavuş, Engin; Tekir, Selma
Over the past years the number of published news articles have an excessive increase. In the past, there was less channel of communication. Moreover the articles were classified by the human operators. In the course of time the means of the communication increased and expanded rapidly. The need for an automated news classification tool is inevitable. The text classification is a statistical machine learning procedure that individual text items are placed into groups based on quantitative information. In this study, an event based news classification and sequencing system is proposed, the model is explained. The decision making process is represented. A case study is prepared and analyzed.
A feedback-based testing methodology for network security software
(Izmir Institute of Technology, 2013) Gerçek, Gürcan; Tekir, Selma
As part of network security testing, an administrator needs to know whether thefirewall enforces the security policy as expected or not, In this setting, black-box testing and evaluation methodologies can be helpful. In this work, we employ a simple mutation operation, namely flipping a bit, to generate mutant firewall policies and use them to evaluate our previously proposed weighted test case selection method for firewall testing. In the previously proposed firewall testing approach, abstract test cases that are automatically generated from firewall decision diagrams instantiated are by selecting test input values fromdifferent test data pools for each field of firewall policy. Furthermore, a case study is presented to valdate the proposed approach.
Finding out subject-matter experts and research trends using bibliographic data
(Izmir Institute of Technology, 2015-09) Karataş, Arzum; Tekir, Selma
With the prevalent use of information technology, it is very easy to reach nearly any information. However, if it is desired to be specialized in an area, the first thing to do is to know who are the experts in that area. Since experts have valuable knowledge, it is important to find these experts. Also, it is vital to be aware of trends for researchers who want to be expert in a topic or who want to enter into a new area. This work includes an empirical study for finding experts and research trends in academic world. We created a citation network from KDD proceedings and an author-keyword bipartite graph from bibliographic data of the same set of proceedings. Then, we applied link analysis algorithms HITS and PageRank, respectively. The results show that it is possible to detect two expert types (one that works intensively on a single subject and another having high level knowledge of various subtopics of a subject-matter). Moreover, topical trends are identified as doing peak, periodic, and having the same shape rather than showing absolute increase, decrease or stationary pose.
Identifying communities using collaboration and word association networks in Turkish social media
(Izmir Institute of Technology, 2018-12) Atay, Abdullah Asil; Tekir, Selma
Social media contents are always very attractive title for researchers. Scores of people use social media and share their ideas with pictures, videos or documents. Researchers analyze this information and they try to deduce beneficial data. A lot of researchers think that analyzing social media information is a very important research area. There are a lot of social media platforms which have Turkish contents. We can give an example Ekşisözlük which have Turkish contents and popular social media platform in Turkey. Within scope of the thesis, Ekşisözlük contents downloaded, decomposed and used actively. Social media consists of human or human made products and sharing contents have some similarities. In this thesis, to calculate similarities, some methods are used. Scope of the thesis, two different networks are created from same content which are word association network and collaboration network. Word association network is a network that created by coexistence of words in specific window size. Collaboration network is a network that created by entered content to same title with different users. This information gives the similarity of users. These two networks are analyzed separately and deduced some information.
An implementation model for open sources evaluation
(Izmir Institute of Technology, 2004) Tekir, Selma; Koltuksuz, Ahmet Hasan
Open sources contain publicly available information at a low cost. The amount of open sources information an individual has access to has increased dramatically in the 21st century. The question is how to evaluate these diversified open sources in producing intelligence- supporting the policymakers in making their decisions by providing them with the necessary information. To retrieve the actionable intelligence from huge amounts of open sources information, and to do source validation are big challenges. In this study, an open sources intelligence (OSINT) integrated intelligence model is proposed, explained and compared with the traditional intelligence model. The contemporary intelligence analyst and his/her OSINT making process is explained. A case study of OSINT is prepared, analyzed and the analysis results are given.
INFORMATION AND COMMUNICATION TECHNOLOGY SECTOR STRATEGY MAP OF IZMIR
(Lookus Scientific, 2013) Tuglular, Tugkan; Tekir, Selma; Velibeyoglu, Koray; Tuğlular, Tuğkan; Bilgisayar Mühendisliği Bölümü
This study aims to understand current dynamics of the Izmir's ICT sector by looking at its dynamics and mapping the spatial distribution of the firms. It is based on series of analysis produced for Izmir Development Agency in 2012 within the frame of preparation of 2014-2023 Izmir Regional Development Plan. It conducts a Delphi survey to support situation knowledge as well as trend prediction for the next 10 years' period. Furthermore, gap analysis is performed to measure the margin between the current situation of the ICT sector and future trends predicted by experts. The study also maps Izmir's ICT sector's location preferences based on Izmir Chamber of Commerce's publicly available web-based database. It illustrated that ICT sector's trend largely based on centripetal and spontaneously developed clusters placed in the central part of the city. On the other hand, planned technology regions and science parks are relatively immature and need to be developed. Within the light of this dichotomy this study proposes a strategy map to Izmir's ICT sector.
A language modeling approach to detect bias
(Izmir Institute of Technology, 2020-07) Atik, Ceren; Tekir, Selma
Technology is developing day by day and is involved in every area of our lives. Technological innovations such as artificial intelligence can strengthen social biases that already exist in society, regardless of the developers' intentions. Therefore, researchers should be aware of this ethical issue. In this thesis, the effect of gender bias, which is one of the social biases, on occupation classification is investigated. For this, a new dataset was created by collecting obituaries from the New York Times website and they were handled in two different versions, with and without gender indicators. Since occupation and gender are independent variables, gender indicators should not have an impact on the occupation prediction of models. In this context, in order to investigate gender bias on occupation estimation, a model in which occupation and gender are learned together is evaluated as well as models that make only occupation classification are evaluated. The results obtained from models state that gender bias has a role in classification occupation.
A lattice-based approach for news chain construction
(Izmir Institute of Technology, 2015-07) Toprak, Mustafa; Tekir, Selma; Allmer, Jens
Each news article and column can be part of a manually created news story or chain by journalists and columnists. However, increasing amounts of data published by news companies each year makes manual analysis thus creation of news stories and chains almost impossible. When the amount of data is considered, it is obvious that automated systems’ support is vital to journalists, columnists and intelligence analysts. A news chain is a set of news articles that form a connected and coherent whole. In the traditional “connecting the dots” approach, news chains are constructed based on given two articles as start and end news of the chain. In this study, a method is proposed to create coherent news chains without the predetermination of start and end articles of the chain. Intuition of the method comes from the partial order relation among news articles. We try to show that lattice structure can represent relation or hierarchy among news articles that have a partial order in nature. Creating concept lattice is prepared out of the inverted index structure of news articles which is one of the main contributions of the study. In the experimental work, an artificial dataset is processed to show the steps of the method. After that, we also provide the evaluation using real dataset results.
News story analysis with credibility assessment by opinion mining
(Izmir Institute of Technology, 2015-07) Sezerer, Erhan; Tekir, Selma
With the growing influence of media and the popularity and widespread use of social networks, credibility of the news sources became an important subject that needs more attention. The biggest problem of finding credible sources is, instead of giving every aspect of the incident, news sources tend to accept one of the parties’ idea as a whole while rejecting every other ideas, or even worse, they focus on only one side of the incident and ignoring the rest. Credibility is defined as “The quality of believable and trustworthy”. The notion of trustworthiness can further be decomposed into components like bias, fairness, factual/ opinionated, etc. In this thesis, credibility is measured using the fact/opinion ratio of the articles. Two methods, which are the traditional Naive Bayes method and the Relativistic method, are proposed. The intuition of relativistic method comes from the theory of relativity where the sentiment of the articles is determined relatively to the ordinary context used by people in daily speech. We have tested our methods on four different types of data, hand-written articles, editorials, New York Times articles and Reuters articles, and aimed to show that our proposed models are able to differentiate the sentiments in the articles. In the experimental work, we provided a detailed evaluation of the results.
Recognition of counterfactual statements in Turkish
(01. Izmir Institute of Technology, 2023-07) Acar, Ali; Tekir, Selma
Counterfactual statements describe an event that did not happen or cannot happen, and optionally the consequence of this event if it would happen. Counterfactual statements are the building blocks of human thought processes as people constantly reflect upon past happenings and consider their future implications. Counterfactual reasoning is essential for machine intelligence and explainable artificial intelligence studies. Detecting counterfactuals automatically with machine learning algorithms is very crucial for these areas. This thesis presents the development of the first-ever Turkish counterfactual detection dataset. It presents a comprehensive classification baseline and expands the scope of counterfactual detection to include the Turkish language.
Reproducibility assessment of research code repositories
(01. Izmir Institute of Technology, 2023-07) Akdeniz, Eyüp Kaan; Tekir, Selma
The growth in machine learning research has not been accompanied by a corresponding improvement in the reproducibility of the results. This thesis presents a novel, fully-automated end-to-end system that evaluates the reproducibility of machine learning studies based on the content of the associated GitHub project's Readme file. This evaluation relies on a readme template derived from an analysis of popular repositories. The template suggests a structure that promotes reproducibility. Our system generates a reproducibility score for each Readme file assessed, and it employs two distinct models, one based on section classification and the other on hierarchical transformers. The experimental outcomes indicate that the system based on section similarity outperforms the hierarchical transformer model. Furthermore, it has a superior edge concerning explainability, as it allows for a direct correlation of the scores with the respective sections of the Readme files. The proposed framework provides an important tool for improving the quality of code sharing and ultimately helps to increase reproducibility in machine learning research.
Rule-Based Automatic Question Generation Using Semantic Role Labeling
(Ieice-inst Electronics information Communication Engineers, 2019) Keklik, Onur; Tuglular, Tugkan; Tekir, Selma; Tuğlular, Tuğkan; Bilgisayar Mühendisliği Bölümü
This paper proposes a new rule-based approach to automatic question generation. The proposed approach focuses on analysis of both syntactic and semantic structure of a sentence. Although the primary objective of the designed system is question generation from sentences, automatic evaluation results shows that, it also achieves great performance on reading comprehension datasets, which focus on question generation from paragraphs. Especially, with respect to METEOR metric, the designed system significantly outperforms all other systems in automatic evaluation. As for human evaluation, the designed system exhibits similar performance by generating the most natural (human-like) questions.