January 26, 15h, Allan Tucker
In particular I will discuss the collection of longitudinal data and how this creates challenges for diagnosis and the modelling of disease progression. I will then discuss how cross-sectional studies offer additional useful information that can be used to model disease diversity within a population but lack valuable temporal information. Finally, I will discuss the importance of inferring models that generalise well to new independent data and how this can sometimes lead to new challenges, where the same variables can represent subtly different phenomena. Some examples in ecology and genomics will be described.
***************************************
18:00 Hours, INESC Auditorium A
****************************************
Presenter: Paula Branco
Title:Utility-based Predictive Analytics with UBL Package.
Abstract:
Many real world applications encompass domain specific information which, if disregarded, may strongly penalize the performance of predictive models. In the context of finance, medicine, ecology, among many other, specific domain information concerning the preference bias of the users must be taken into account to enhance the models predictive performance. In this seminar we will address the problem of utility-based learning.We will show the main challenges of this type of problems and a broad taxonomy for the existing solutions. We will introduce the R package UBL that implements several approaches for tackling utility-based problems considering both classification and regression tasks.Finally, we will provide some simple examples of solutions implemented in UBL package showing how they can be used.
Short Bio:
Paula Branco has a degree in Mathematics and a Master degree in Computer Science both from the University of Porto. Currently she is a PhD student in MAPi Doctoral Programme. She is working in
"Utility-based Predictive Analytics" under the supervision of Prof.Luís Torgo and Prof. Rita Ribeiro.
Her main research interests are in Machine Learning, Data Mining and, in particular, in imbalanced distributions, outliers detection, rare extreme values forecasting and performance assessment.
Title: Big Data meets Internet of Things
Abstract:
The Internet of Things (IoT) is one of the major disruptive technologies and is on top of Gartner’s hype curve for 2014/2015. IoT will connect billions of "things", where things include computers, smartphones, sensors, objects from everyday life. IoT will be the main source of big data according to predictions of many experts. This talk focuses on challenges of the IoT and disruptively big data it generates. The talk will also showcase a CSIRO IoT technology which brings together sensing and cloud computing and is an efficient open platform for handling IoT data streams of high volume, velocity, value and variety. A case study built on the basis of the OpenIoT platform will also be presented.
Bio:
Dr Arkady Zaslavsky is a Senior Principal Research Scientist in Data61 @ CSIRO. He is leading the scientific area of IoT at Data61 and leads a number of projects and initiatives. Before coming to CSIRO in July 2011, he held a position of a Chaired Professor in Pervasive and Mobile Computing at Luleå University of Technology, Sweden where he was involved in a number of European research projects, collaborative projects with Ericsson Research, PhD supervision and postgraduate education. He currently holds the titles of a Research Professor at LTU (Sweden), Adjunct-Professor at UNSW (Sydney), Adjunct Professor at La Trobe University (Melbourne), Visiting Professor at StPetersburg University of ITMO. Between 1992 and 2008 Arkady was a full-time academic staff member at Monash University, Australia, where he held various academic and administrative positions including Director of Monash Research Centre for Distributed Systems and Software Engineering, Director of Monash CoolCampus initiative that brought together pervasive computing researchers with university users. He was a Principal Investigator at CRC DSTC Distributed Systems Technology Centre leading and contributing to the DSTC project “M3: Enterprise Architecture for Mobile computations”. He led and was involved in a number of research projects funded by ARC (ARC Discovery and Linkage) and industry, both at national and international level, totalling in support more than AU$ 12,000,000. He chaired and organised many international workshops and conferences, including Mobile Data Management, Pervasive Services, Mobile and Ubiquitous Multimedia and others. Arkady made internationally recognised contribution in the area of disconnected transaction management and replication in mobile computing environments, context-awareness as well as in mobile agents. He made significant internationally recognised contributions in the areas of data stream mining on mobile devices, adaptive mobile computing systems, ad-hoc mobile networks, efficiency and reliability of mobile computing systems, mobile agents and mobile file systems. Arkady received MSc in Applied Mathematics majoring in Computer Science from Tbilisi State University (Georgia, USSR) in 1976 and PhD in Computer Science from the Moscow Institute for Control Sciences (IPU-IAT), USSR Academy of Sciences in 1987. Before coming to Australia in 1991, Arkady worked in various research positions at industrial R&D labs as well as at the Institute for Computational Mathematics of Georgian Academy of Sciences where he lead a systems software research laboratory. Arkady Zaslavsky has published more than 400 research publications throughout his professional career and supervised to completion more than 35 PhD students. Dr Zaslavsky is a Senior Member of ACM and a Senior Member of IEEE Computer and Communications Societies.
To overcome these problems, we will present a new model named the Compact Prediction Tree (CPT+). CPT+ is built by losslessly compressing the training sequences. This ensure that all relevant information is available for each prediction. Furthermore, CPT+ relies on an indexing mechanism to allow fast sequence searching and matching, and are a more complex prediction algorithm, which integrates several optimizations.
Experimental results on seven real life datasets from various domains show that CPT+ has the best overall accuracy when compared to six state-of-the-art sequence prediction models from the literature: All-K-order Markov, CPT, DG, Lz78, PPM and TDAG.
Bio: Philippe Fournier-Viger (Ph.D.) is an assistant-professor at University of Moncton, Canada. He received a Ph.D. in Cognitive Computer Science at the University of Quebec in Montreal (2010). He has published more than 75 research papers in refereed international conferences, books and journals. His research interests are data mining, algorithm design, pattern mining, sequence mining, sequence prediction, text mining and e-learning. He is the founder of the popular SPMF open-source data mining library, specialized in pattern mining, which has been cited in more than 170 papers since 2010.
****************************************
11th Seminar, 31 of March, Tuesday,
11:00 Hours, INESC Auditorium A
*****************************************
Presenter: Ana Costa e Silva
Title:
Measure, Model, Deploy: analytics maturity at the hand of the end-user with TIBCO's suite
Abstract:
Lots of value can be created when an organisation has streamlined access to its data that allows it to identify issues and forces early on. The next step of analytics maturity consists of understanding the past via predictive modelling and operational optimisation. But the full value of analytics is only set free when organisations deploy those models in real time and start managing the present as it happens. Real-time long distance monitoring of equipment or whole factories, or of health readings from hospital patients become then possible. Or continuous transaction monitoring for fraud detection. Or tracking of customers activities on our website for maths supported recommendations that are helpful for once.
In this talk we will show how TIBCO's products can be woven to give the business end-user an easy-to-use interface to accomplish all those goals, on small or big data.
Bio:
For the last 15 years, Ana has been passionate about searching and finding gems in data, Maths and Stats have always been the connecting thread throughout her career. After initial studies in management and a Mestrado in data analysis (MADSAD) at FEP-Porto plus 7 years working in the Statistics department of the Portuguese Central Bank, Ana completed a PHD in computer science at Edinburgh University and then spent an extra 4 years researching the inner workings of the global stock market for Edinburgh Partners. Ana now works within TIBCO Spotfire's Industry Analytics Group. The group diligently surfaces sector specific use cases that can deliver to TIBCO's clients the full value of their data, via analytics and visualisation. Representing the Group in EMEA and Latin America, Ana has helped a number of organisations in their baby or sage steps into the world of Analytics and Big Data.
*****************************************
Presenter: Inês Dutra (CRACS)
Title: Effective Classification of non-definitive ARS biopsies using First-Order Rules
Abstract:
Expert knowledge expressed in the form of first-order rules is used in order to improve the performance of machine learned Naıve Bayes models on a subgroup of non-definitive biopsies. Our results show that well tailored rules specific to a subgroup of non-definitive biopsies combined with Naıve Bayes models can improve routine practice by saving women from going to excision surgery while keeping 100% sensitivity.
****************************************
Presenter: Theofrastos Mantadelis
Title: MetaProbLog for Probabilistic Logic Programming and Learning
Abstract:
MetaProbLog is a framework of the ProbLog probabilistic logic programming language. ProbLog extents Prolog programs by annotating facts with probabilities. In that way it defines a probabilistic
distribution over all Prolog programs. ProbLog follows the distribution semantics presented by Sato. MetaProbLog extends the semantics of ProbLog by defining a "ProbLog engine" which permits the
definitions of probabilistic meta calls.
MetaProbLog uses state-of-the-art knowledge compilation methods, tabling, and several optimizations in order to provide efficient probabilistic inference. Further than supporting the semantics and features of ProbLog, MetaProbLog introduces semantics for probabilistic meta calls and has several unique features such as: datasets, memory management.
In this talk we will present some key differences among the 3 existing ProbLog systems and present several motivating applications like: probabilistic graph mining, parameter learning, probabilistic
structure learning.
Bio:
Theofrastos Mantadelis is a post doc researcher at the Computer Science department of University of Porto. His PhD was in KU Leuven, focusing on the efficiency of ProbLog where he was the key implementer of the two ProbLog systems. His research interest lie mostly in logic programming and probabilistic logic programming.
Presenters: Luis Trigo, Rui Sarmento e Pavel Brazdil
Title: Affinity Miner applied to Researchers' Publications via Network Analysis and Keywords
Abstract:
A case study and demo oriented towards 5 INESC Tec centers (LIAAD, CRACS, CESE, CTM, CEGI), accompanied by brief description of the methods implemented.
Finding people with similar skills within a domain may provide an important support for managing research centers. As academic production is easily accessible in academic and bibliographic databases which can be used to uncover the affinities among researchers that are not yet evidenced by co-authorship. This is achieved with the help of text mining techniques on the basis of theterms used in the respective documents. The affinities can be represented in the form of network where nodes represent researchers’ articles and links represent similarity. Each node can be characterized by various centrality measures. Community detection algorithm permits to identify groups with similar articles. Each node is characterized further by a set of automatically discovered keywords.
This presentation provides more details about the methods adopted and/or developed, some of which were implemented in our prototype. The methods presented are general and applicable to many diverse domains.These can include documents describing R&D projects, legal documents, court cases or medical procedures. We believe this work could thus be useful to a relatively wide audience.We acknowledge the help of F.Silva and collaborators who maintain the Authenticus bibliographic database.
**********************************************
3rd Seminar, 28th October, 2014.
********************************************
1st Seminar, 24th September, 2014.
******************************************
Presenter: Alexandre Carvalho (PhD Researcher at the Laboratory for Artificial Intelligence and Decision Support - INESC Technology and Science (LIAAD - INESC TEC)
Title: First-Principle Simulation for Data-Driven Prediction
Abstract : Data simulation with benchmark problems for machine learning and data mining purposes is an important task. Generating a realist controlled scenario allows to design and test specific quests, such as, drift detection or fault detection. Using common benchmark datasets enables simple comparisons among algorithms and their performance. With first-principle simulated data the behavior of real scenarios are approximated to best of our scientific models. In this work we adapt and update a well-known benchmark system control problem, the Tennessee Eastman plant-wide industrial problem. The Tennesse Eastman (TE) plant-wide industrial process control problem was proposed as a test of alternative control and optimization strategies for continuous chemical processes. With a slow drift, and several step process disturbances combining with random variation. The TE problem is suitable for a wide range of controlled scenario tests. We present the results of the multi-target prediction problem and compared the results of PLS, M5 and MTSMOTI.
********************************************************************************
Presenter: Fábio Pinto (a PhD student working with Carlos Soares (CESE - INESC TEC) and João Mendes-Moreira (LIAAD - NESC TEC).
Title: Pruning Bagging Ensembles with Metalearning
Abstract: Ensemble learning algorithms often benefit from pruning strategies that allow to reduce the number of individuals models and improve performance. In this work, we propose a Metalearning method for pruning bagging ensembles. Our proposal differs from other pruning strategies in the sense that allows to prune the ensemble before actually generating the individual models. The method consists in generating a set characteristics from the bootstrap samples and relate them with the impact of the predictive models in multiple tested combinations. We executed experiments with bagged ensembles of 20 and 100 decision trees for 53 UCI classification datasets. Results show that our method is competitive with a state-of-the-art pruning technique and bagging, while using only 25% of the models.
Past Seminars( 2013-2014)
*************************************
18th Seminar, 23rd July 2014.
*************************************
Presenter: Jose C. Principe
Title: A Cognitive Architecture for Object Recognition in Video
Abstract: This talk describes our efforts to abstract from the animal visual system the computational principles to explain images in video. We develop a hierarchical, distributed architecture of dynamical systems that self-organizes to explain the input imagery using an empirical Bayes criterion with sparseness constraints and dual state estimation. The interpretation of the images is mediated through causes that flow top down and change the priors for the bottom up processing. We will present preliminary results in several data sets.
Short Bio:
Jose C. Principe (M’83-SM’90-F’00) is a Distinguished Professor of Electrical and Computer Engineering and Biomedical Engineering at the University of Florida where he teaches advanced signal processing, machine learning and artificial neural networks (ANNs) modeling. He is BellSouth Professor and the Founder and Director of the University of Florida Computational NeuroEngineering Laboratory (CNEL) www.cnel.ufl.edu . His primary area of interest is processing of time varying signals with adaptive neural models. The CNEL Lab has been studying signal and pattern recognition principles based on information theoretic criteria (entropy and mutual information). Dr. Principe is an IEEE Fellow. He was the past Chair of the Technical Committee on Neural Networks of the IEEE Signal Processing Society, Past-President of the International Neural Network Society, and Past-Editor in Chief of the IEEE Transactions on Biomedical Engineering. He is a member of the Advisory Board of the University of Florida Brain Institute. Dr. Principe has more than 600 publications. He directed 81 Ph.D. dissertations and 65 Master theses. He wrote in 2000 an interactive electronic book entitled “Neural and Adaptive Systems” published by John Wiley and Sons and more recently co-authored several books on “Brain Machine Interface Engineering” Morgan and Claypool, “Information Theoretic Learning”, Springer, and “Kernel Adaptive Filtering”, Wiley.
*************************************
17th seminar, 1st July 2014
*************************************
Presenter: Peter Clark
Senior Research Manager at the Allen Institute for Artificial Intelligence
Title: From Information Retrieval towards Knowledgeable Machines
Abstract:
At some point in the future, we will have knowledgeable machines - machines that contain internal models of the world and can answer questions, explain those answers, and dialog about them. A substantial amount of that knowledge will likely come from machine reading, whereby internal representations are synthesized from textual information. We are exploring this at the new Allen Institute for Artificial Intelligence (AI2) with a medium-term focus on having the computer pass fourth-grade science tests, with much of that knowledge acquired semi-automatically from text. In this presentation I will outline our picture of such a system and summarize some of our early research, in particular explorations in direct "reading" of rules from texts (e.g., study guides), and how we are seeking to go beyond the limits of information retrieval for question-answering.
BIO: Peter Clark is the Senior Research Manager for AI2. His work focuses upon natural language processing, machine reasoning, and large knowledge bases, and the interplay between these three areas. He has received several awards including a AAAI Best Paper (1997), Boeing Associate Technical Fellowship (2004), and AAAI Senior Member (2014). He received his Ph.D. in Computer Science in 1991, and has researched these topics for 30 years with more than 80 refereed publications and over 5000 citations.
*********************************************************************
Presenter: Cesar Guevara
Title: Development of Efficient Algorithms for Detection of Intruders and Detection Data Leak In Computer Systems
Abstract:
Detection and control of intruders, data leakage or unauthorized access to computer systems has always been important when dealing with information systems where security, integrity and privacy are key issues. Although computer devices are more sophisticated and efficient, there is still the necessity of establishing safety procedures to avoid illegitimate accesses. The purpose of this work is to show how different intelligent techniques can be used to create new algorithms and identify users accessing critical information and to check whether or not access is allowed. Advanced and intelligent analysis and data mining techniques such as decision trees and artificial neural networks have been applied to obtain patterns of users’ behavior. Dynamic user profiles are obtained. The main
contribution of this work is to show effective solutions for the detection of intruders and data leakage in computer information systems.
*************************************
16th Seminar, 25th June 2014.
*************************************
Presenter: Conceição Rocha
Title: Data Assimilation: contributions for modeling, prevision and control in anesthesia
Abstract:
During surgical interventions a muscle relaxant drug is frequently administrated with the objective of inducing muscle paralysis. This work aims at contributing to personalize anesthetic drug administration during surgery. In fact, personalization is one of the aims of P4, Predictive, Preventive, Personalized and Participatory, medicine which is the modern trend in health care. Furthermore, clinical environment and patient safety issues lead to a huge variety of situations that must be taken into account requiring intensive simulation studies. Hence, population models are crucial for research and development in this field. In this work, we develop two models - a stochastic population models for the muscle paralysis) level induced by atracurium, and an online robust model to predict the maintenance dose of atracurium necessary for the resulting effect. We also address the problem of joint estimation of the state and parameters for a deterministic continuous time system, with discrete time observations, in which the parameter vector is constant but its value is not known, being a random variable with a known distribution.
*********************************************************************
Presenter: Sónia Dias
Title: Linear regression with empirical distributions
Professor at the Polytechnic Institute of Viana do Castelo
Abstract:
In the classical data framework one numerical value or one category is associated with each individual (microdata). However, the interest of many studies lays in groups of records gathered according to characteristics of the individuals or classes of individuals, leading to macrodata. The classical solution for these situations is to associate with each individual or class of individuals a central measure, e.g., the mean or the mode of the corresponding records; however with this option the variability across the records is lost. For such situations, Symbolic Data Analysis proposes that a distribution or an interval of the individual records' values is associated with each unit, thereby considering new variable types, named symbolic variables. One such type of symbolic variable is the histogram-valued variable, where to each entity under analysis corresponds an empirical distribution that can be represented by a histogram or a quantile function. If for all observations each unit takes values on only one interval with weight equal to one, the histogram-valued variable is then reduced to the particular case of an interval-valued variable. In either case, an Uniform distribution is assumed within the considered intervals. Accordingly, it is necessary to adapt concepts and methods of classical statistics to new kinds of variables. The functional linear relations between histogram or between interval-valued variables cannot be a simple adaptation of the classical regression model. In this presentation new linear regression models for histogram data and interval data are presented. These new Distribution and Symmetric Distributions Regression Models allow predicting distributions/intervals, represented by their quantile functions, from distributions/intervals of the explicative variables. To determine the parameters of the models it is necessary to solve quadratic optimization problems subject to non-negativity constraints on the unknowns. To define the minimization problems and to compute the error measure between the predicted and observed distributions, the Mallows distance is used. As in classical analysis, it is possible to deduce a goodness-of-fit measure from the models whose values range between 0 and 1. Examples on real data as well as simulated experiments illustrate the behavior of the proposed models and the goodness-of-fit measure. These studies indicate a good performance of the proposed methods and of the respective coefficients of determination.
Homepage: http://www.estg.ipvc.pt/~sdias
*************************************
15th Seminar, 16th June 2014.
*************************************
Presenter: Pascal Poncelet
Title: Towards an Unifying Approach for Extracting Trajectories
Abstract:
Recent improvements in positioning technology has led to a much wider availability of massive moving object data. A crucial task is to find the moving objects that travel together. Usually, they are called spatio-temporal patterns. Due to the emergence of many different kinds of spatio-temporal patterns in recent years, different approaches have been proposed to extract them. However, each approach only focuses on mining a specific kind of pattern. In addition to the fact that it is a painstaking task due to the large number of algorithms used to mine and manage patterns, it is also time consuming. Additionally, we have to execute these algorithms again whenever new data are added to the existing database. In this talk I will present a unifying approach, named GeT Move, using a frequent closed itemset-based spatio-temporal pattern-mining algorithm to mine and manage different spatio-temporal patterns. GeT Move is implemented in two versions which are GeT Move and Incremental GeT Move. Furthermore I will address how trajectories can be used in other kinds of domains.
Homepage: http://www.lirmm.fr/~poncelet/indexEN.html
*************************************
The 14th seminar, 11th June 2014.
*************************************
Apresentador: Leandro Nunes de Castro
Título: Computação Natural: Conceitos e Aplicações
Resumo: A computação pode ser vista em três diferentes contextos dentro da Computação Natural: para a resolução de problemas complexos; para a síntese de fenômenos naturais; e para a busca de novas matérias primas com as quais computar. Em todos os casos, uma compreensão adequada do fenômeno natural é a base para novas ideias e o entendimento de como a computação é realizada. Este entendimento é geralmente obtido através de modelos, por exemplo, dinâmica dos planetas, imunologia, redes de reações químicas, bactérias, diversidade de espécies, dentre muitos outros. Estes modelos tornaram-se tão importantes para o entendimento da natureza que um novo ramo da computação natural foi proposto para incorporar a modelagem computacional de fenômenos naturais. Essa apresentação faz uma introdução geral sobre a área, destacando os principais focos de pesquisa do Laboratório de Computação Natural (LCoN) da Universidade Mackenzie, SP, Brasil. Serão apresentados estudos de caso em análise de dados de mídias sociais, logística, dentre outras.
CV: http://buscatextual.cnpq.br/buscatextual/visualizacv.do?metodo=apresentar&id=K4769993T4
**************************************
Apresentador: Vinícius M. A. de Souza
Título: Como a Inteligência Artificial pode contribuir no combate de insetos vetores de doenças e pragas agrícolas?
Resumo: Ao longo de toda a história da humanidade, insetos tem um forte relacionamento com o bem estar das pessoas, seja de maneira positiva ou negativa. Insetos são vetores de doenças que matam milhões de pessoas todos os anos e, ao mesmo tempo, são responsáveis pela polinização de boa parte da produção alimentícia mundial. Por estas razões, muitos pesquisadores tem desenvolvido um arsenal de métodos de controle de insetos com o objetivo de reduzir a presença de espécies maléficas com o mínimo de impacto para espécies benéficas. Neste seminário será discutido como a área de Inteligência Artificial pode contribuir para combater insetos vetores de doenças e pragas agrícolas. Mais especificamente, será apresentado os objetivos e desafios do projeto de um sensor laser de baixo custo capaz de realizar contagens e classificações de espécies de insetos utilizando algoritmos de Aprendizado de Máquina e métodos de Processamento Digital de Sinais.
CV: http://lattes.cnpq.br/6394929576717854
*************************************
The 13th seminar , 3rd June, 2014
*************************************
Presenter: Aljaž Osojnik
PhD student working with João Gama
Title: Learning models for structured output prediction from data streams
Abstract: Nowadays, data is generated at ever increasing rates and it utilizes more and more complex data structures. We present the problem of online structured output prediction, namely, we describe the online data stream mining approach and the structured output prediction problem. We describe several issues that arise in online structured output prediction, i.e., evaluation, change detection and resource complexity. We focus on the structured output prediction tasks of multi-label classification, multi-target regression and hierarchical multi-label classification. We provide an overview of the current research in the areas of batch and online methods for multi-label classification, multi-target regression and hierarchical multi-label regression, as well as some of the evaluation metrics used in these cases. We conclude with a discussion of directions of further work in the area of improving existing multi-target regression methods and how those can be applied to the tasks of multi-target classification and multi-label classification, as well as adapting current batch hierarchical multi-label classification for the online setting.
**************************************
Presenter: Carlos Ferreira
PhD student working with João Gama
Title: Exploring Temporal Patterns from Multi-relational Databases
Abstract: Multi-relational databases are widely used to represent and store data. Often, a multi-relational database is composed by tables recording static data, that do not change over time, and by tables recording dynamic data, that is being accumulated over time. Finding temporal patterns in such temporal databases is an important challenge in domains as diverse as video processing, computer biology and elderly monitoring. The main goal of this work is to study methods and techniques to explore the temporal information available in such multi-relational databases. Mainly, to find rich patterns and learn highly expressive classification theories. In particular, we explore temporal information using either propositional or first-order logic sequence miners. Moreover, we employ propositionalization and predicate invention techniques to learn either propositional or FOL theories.
*************************************
The 12th seminar , 16th May, 2014
*************************************
Presenter: Hadi Fanaee Tork
PhD student working with João Gama
Title: Event labeling combining ensemble detectors and background knowledge
Abstract: Event labeling is the process of marking events in unlabeled data. Traditionally, this is done by involving one or more human experts through an expensive and time-consuming task. In this presentation we propose a new event labeling model relying on an ensemble of detectors and background knowledge. The target data are the usage log of a real bike sharing system. We first label events in the data and then evaluate the performance of the ensemble and individual detectors on the labeled data set using ROC analysis and static evaluation metrics in the absence and presence of background knowledge. The results show that when there is no access to human experts, the proposed approach can be an effective alternative for labeling events.
Paper, Data set
**************************************
Presenter: Mohammadreza Valizadeh
PhD student working with Pavel Brazdil
Title: Improving the Performance of Text Information Retrieval (IR) Systems
Abstract: This thesis focus on two major issues. One is re-ranking and the other is summarization. We have proposed a new method for re-ranking based on Query Sensitive similarity measure. After that, the retrieved documents can be summarized. we have proposed several methods for summarizing multiple documents. 1 unsupervised (unsupervised Graph-based summarization) and 2 supervised methods have been proposed (user-based method and Ensemble Method Combined with Actor-Object Relationship).
.****************************************
11th Seminar, April 30, 2014
*****************************************
Presenter: Vânia Almeida
Senior researcher at LIAAD - INESC TEC;
Title: Collaborative Wind Power Forecast
Abstract: Wind power is considered one of the most rapidly growing sources of electricity generation all over the world. This talk presents a new collaborative forecasting framework for wind power that uses information from distributed neighbor wind farms, being the wind power prediction at each wind farm based on data from different locations. The experiments are based on real wind power measurements available from 16 wind farms. The scope is the short term wind power forecast (six-hours-ahead) using Auto-Regressive Integrated Moving Average (ARIMA) models. The problem was defined in two main issues: 1) search for motifs using Aggregate Approximation (SAX) representation and 2) construction of the correlation network, being the desired output the decrease of the root mean square error (RMSE), taking as the reference the models using only data from each farm.
**************************************
Presenter: João Cordeiro
Researcher at LIAAD - INESC TEC and a Professor at the Department of Informatics of the University of Beira Interior.
Title: Learning Sentence Reduction Rules from the Wilderness
Abstract: Sentence Compression has recently received a great attention from the research community of Automatic Text Summarization (ATS). Sentence Reduction consists in the elimination of sentence components such as words, part-of-speech tags sequences or chunks without highly deteriorating the information contained in the sentence and its grammatical correctness. In this presentation I will start by making a quick and broad overview of the field of ATS, followed by a more detailed explanation of our work in the subfield of Sentence Compression. In particular, I will present an unsupervised scalable methodology for learning sentence reduction rules. First, Paraphrases are discovered within a collection of automatically crawled Web News Stories and then textually aligned in order to extract interchangeable text fragment candidates, in particular reduction cases. As only positive examples exist, Inductive Logic Programming (ILP) provides an interesting learning paradigm for the extraction of sentence reduction rules. As a consequence, reduction cases are transformed into first order logic clauses to supply a massive set of suitable learning instances and an ILP learning environment is defined within the context of the Aleph framework. Experiments evidence good results in terms of irrelevancy elimination, syntactical correctness and reduction rate in a real-world environment as opposed to other methodologies proposed so far.
*************************************
10th Seminar ,16 of April 2014, LIAAD Main Auditorium
************************************
Presenter: Dalila B.M.M. Fontes
Title: Scheduling Projects with alternative tasks subject to technical failure
Abstract: Nowadays, organizations are often faced with the development of complex and innovative projects. This type of projects often involves performing tasks which are subject to failure. Thus, in many such projects several possible alternative actions are considered and performed simultaneously. Each alternative is characterized by cost, duration, and probability of technical success. The cost of each alternative is paid at the beginning of the alternative and the project payoff is obtained whenever an alternative has been completed successfully. For this problem one wishes to find the optimal schedule, i.e. the starting time of each alternative, such that the expected net present value is maximized.This problem has been recently proposed by Ranjbar and Davari (2013), where a branch-and-bound approach is reported. Here we propose to solve the problem using dynamic programming.
**************************************
Presenter: Alberto Adrego Pinto
Title: Price competition in the Hotelling model with uncertainty on costs
Abstract: This work develops a theoretical framework to study price competition in a Hotelling-type network game, extending the Hotelling model of price competition with linear transportation costs from a line to a network. Under explicit conditions on the production costs and road lengths we show the existence of a pure Nash price equilibrium. Furthermore, we introduce incomplete information in the production costs of the firms and we find the Bayesian-Nash price equilibrium.
*************************************
9th Seminar, 26 March 2014, LIAAD meeting room
*************************************
Presenter: José Fernando Gonçalves
Title: A biased random-key genetic algorithm for the Minimization of Open Stacks Problem
Abstract: This presentation describes a biased random-key genetic algorithm (BRKGA) for the Minimization of Open Stacks Problem (MOSP). The MOSP arises in a production system scenario, and consists of determining a sequence of cutting patterns that minimizes the maximum number of opened stacks during the cutting process. The approach proposed combines a BRKGA and a local search procedure for generating the sequence of cut patterns. A novel fitness function for evaluating the quality of the solutions is also developed. Computational tests are presented using available instances taken from the literature. The high-quality of the solutions obtained validate the proposed approach.
Supported by Fundação para a Ciência e Tecnologia (FCT) through projects PTDC/EGE-GES/117692/2010.Keywords: Minimization of Open Stacks Problem, Cutting Pattern, Biased Random-Key Genetic Algorithm, Random keys.
**************************************
Presenter: Paula Brito
Title: Multivariate Analysis of Distributional Data
Abstract: In statistics and multivariate data analysis, the units under analysis are usually single elements described by numerical and/or categorical variables, each element taking one value for each variable. However, the data under analysis may not be single observations, but groups of units gathered on the basis of common properties, or observed repeatedly over time, or concepts described as such – therefore the observed values present variability. In such situations, data is usually reduced to central statistics leading to loss of important information. Symbolic Data Analysis provides a framework where the observed variability are considered in the data representation. To describe groups of individuals or concepts, new methods are developed and new variable types are introduced, which may now assume other forms of realizations (e.g., sets, intervals, or distributions for each entity) that take into account the intrinsic data variability. In this talk, we consider the case where individual observations are summarized by distributions, and recall some methods that have been developed to analyse such data. In particular, we shall focus on clustering methodologies.
Keywords: complex data, distribution data, histogram-valued variables, symbolic data
**************************************
8th Seminar, February 26, 2014 INESC Porto Main Auditorium
*************************************
Presenter: Ricardo Bessa, USE
Senior researcher at INESC TEC in its Power Systems Unit
Title: Spatial-Temporal Solar Power Forecasting for Smart Grids
Abstract: The solar power penetration in distribution grids is growing fast during the last years, particularly at the low voltage (LV) level, which introduces new challenges when operating distribution grids. Across the world, Distribution System Operators (DSO) are developing the Smart Grid concept and one key tool, for this new paradigm, is the solar power forecasting. This talk presents a new spatial-temporal forecasting framework, based on the vector auto-regression framework, which combines observations of solar generation collected by smart meters and distribution transformer controllers. The scope is six-hour-ahead deterministic and probabilistic forecasts at the residential solar photovoltaic and MV/LV substation levels. This framework has been tested in the Smart Grid pilot of Évora, Portugal, and using data from 44 micro-generation units and 10 MV/LV substations. A benchmark comparison was made with the Autoregressive forecasting framework (AR - univariate model).
***********************************
Presenter: Fabien Gouyon
Senior researcher at INESC TEC, UTM, leading the Sound and Music Computing research group
Title: Evaluating the evaluation, the case of music classification
Abstract: In this talk, I take a critical viewpoint on the validity of current approaches to evaluation in Music Information Retrieval research, and in particular music classification. Experiments using three state-of-the-art approaches to building music classification systems crossed with three different datasets show that performance measured by the standard approach to evaluation is not valid for concluding whether a music classification system is objectively good, or better than another. I am particularly interested in opening discussion to evaluation issues in machine learning.
***********************************
7th Seminar, 14th February, 2014
***********************************
Presenter: Diego Marron,
A student working with Albert Bifet at Yahoo! Research
Title: GPU Random Forests and Decision Trees for Evolving Big Data Streams
Abstract: Web companies have an increasing need of more and more computation power, to effectively analyze big data streams in real-time in order to extract useful information. Most of this data are short lived and evolve with time. Big data stream analysis usually is done in clusters that, due to an increasing demand of more computation power, are going in size. This situation bring the opportunity of exploring new ways to achieve better performance with less resources. One option is to use GPU to process evolving big data streams. GPU are throughput oriented massive parallel architecture providing very attractive performance boosts.
In this thesis we present an implementation of Random Forest tree ensemble using random Very Fast Decision Tree on GPU. The results are compared to two well know machine learning frameworks such as VFML and MOA, achieving speedups on the GPU of at least 300x faster and with similar accuracy. In our tests we used only one GPU for evaluation, which is also cheaper to use and maintain than a cluster. Moreover, we minimized communication between CPU and GPU for each batch to be processed to only two: a first one from the CPU to the GPU to send data to process, and a second in the opposite direction to get the final result.
***********************************
6th Seminar, 29th January, 2014
***********************************
Presenter: Nuno Escudeiro
Title: Active learning: when to stop querying?
Abstract:
The main goal in Active Learning (AL) is to select an accurate hypothesis from the version space at low cost, i.e., while requiring as few queries as possible. Asking the oracle to label more instances than those that are necessary has a negative impact on the performance (and cost) of the learning process. From this point of view, knowing when to stop might be as relevant as having a good query selection strategy. The AL process should be stopped when the utility of new queries degrades below a given threshold and model quality stops improving. Specifying this utility, and the critical threshold, are task-dependent. Some simple stopping criteria, such as, exhausting the unlabeled set or predefining a desired size for the training set according to the available budget, are obvious but do not take into account efficiency concerns neither assure that the resulting learner is accurate enough for the task at hand. When the goal is to reduce the cost of the learning process, as in our case, it is important to analyze whether the most informative instance is still valuable enough; the utility of the queries should compensate their cost. Therefore, querying the oracle should stop once the cost of querying overcomes the utility of the unlabeled instances still remaining in the working set. In this talk we discuss three base stopping criteria -- classification gradient, steady entropy mean and steady entropy distribution – plus two hybrid criteria aggregating the former in a specific way as to improve over its foundational criteria.
***********************************
Presenter: Márcia Oliveira
Title: Visualizing Evolving Social Networks using Node-level and Community-level Trajectories
Abstract: Visualization of static social networks is a mature research field in information visualization. Conventional approaches rely on node-link diagrams which provide a representation of the network topology by representing nodes as points and links between them as lines. However, the increasing availability of longitudinal network data has spurred interest in visualization techniques that go beyond the static node-link representation of a network. In temporal settings, the focus is on the network dynamics at different levels of analysis (e.g. node, communities, whole network). Yet, the development of visualizations that are able to provide actionable insights into different types of changes occurring on the network and their impact on both the neighbourhood and the overall network structure is a challenging task. This work attempts to tackle this challenge by proposing a methodology for tracking the evolution of dynamic social networks, at both the node-level and the community-level, based on the concept of temporal trajectory. We resort to three-order tensors to represent evolving social networks and we further decompose them using a Tucker3 model. The two most representative components of this model define the 2D space where the trajectories of social entities are projected. To illustrate the proposed methodology we conduct a case study using a set of temporal self-reported friendship networks.
Paper
***********************************
5th Seminar, 10th January, 2014
***********************************
Presenter: Albert Bifet
Title: Mining Big Data in Real Time
Abstract:
Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data. In this talk, we will focus on advanced techniques in Big Data mining in real time using evolving data stream techniques. We will present the MOA software framework with classification, regression, and frequent pattern methods, and the new SAMOA distributed streaming software.
************************************
Presenter: Brett Drury
Title: Creating Bayesian Networks from Text
Abstract: Bayesian networks can represent knowledge and make inferences in complex domains, but their construction is not easy. On the other hand, much of human knowledge is in texts (newspapers, articles, etc.) and with the invention of Internet access to these texts has become easy. Consequently, research strategies to automatically create Bayesian networks for complex domains from information in texts has become an area of current and relevant research. This presentation will discuss methods for constructing Bayesian networks from information in texts.
***********************************
4th Seminar, 17th of December, 2013.
***********************************
Presenter: Pavel Brazdil, Rui Leite and Carlos Soares
Title: Metalearning & Algorithm Selection
Abstract: First we present the motivation for this work. As the number of possible algorithms increases, the user is faced with a problem of algorithm selection. This problem arises in many different domains, starting with classification, regression and other subareas of machine learning and data mining to optimization and satisfiability. We describe how meta-learning can be used to aid the user in selecting the appropriate algorithm for a given problem. The seminar will cover both standard methods based on static meta-level characteristics and more recent approaches that exploit experimentation in order to present the user with a viable suggestion. In this context we present rather “mysterious” formula that permits to estimate the success of alternative solutions which has led to very good experimental results.
In the second part of this talk we will elucidate how this work can be can be generalized to help the user to conceive successful workflows of operations in data mining. Finally, we will explain how the techniques can be re-used in other domains, including e.g. optimization problems and satisfiability, and also, who is currently working on which problem.
************************************
Presenter: Raquel Sebastião,
PhD student working with João Gama
Title: Learning from Data Streams: Synopsis and Change Detection
Abstract: The emergence of real temporal applications under non-stationary scenarios has drastically altered the ability to generate and gather information. Nowadays, potentially unbounded and massive amounts of information are generated at high-speed rate, known as data streams. Therefore, it is unreasonable to assume that processing algorithms have sufficient memory capacity to store the complete history of the stream. Indeed, stream learning algorithms must process data promptly, discarding it immediately. Along with this, as data flows continuously for large periods of time, the process generating data is not strictly stationary and evolves over time.
This presentation embraces concerns raised when learning from data streams. Namely, concerns raised by the intrinsic characteristics of data streams and by the learning process itself. The former is addressed through the construction of synopses structures of data and change detection methods. The latter is related to the appropriate evaluation of stream learning algorithms.
***********************************
3rd, Seminar, 27th November, 2013.
***********************************
Presenter: Luís Matias,
PhD student working with João Mendes Moreira and João Gama
Titulo: On Predicting the Taxi-Passenger Demand: A Real-Time Approach
Abstract: Informed driving is increasingly becoming a key feature to increase the sustainability of taxi companies. The sensors installed in each vehicle are providing new opportunities to automatically discover knowledge, which in return deliver information for real-time decision-making. Intelligent transportation systems for taxi dispatching and for finding time-saving route are already exploring this sensing data. This paper introduces a novel methodology to predict the spatial distribution of taxi-passenger for a short-term time horizon using streaming data. Firstly, the information was aggregated into a histogram time series. Then, three time series forecasting techniques were combined to originate a prediction. Such techniques are able to learn in real-time by their incremental characteristics. Consequently, they easily react to bursty or unexpected events. Experimental tests were conducted using the online data transmitted by 441 vehicles of a fleet running in the city of Porto, Portugal. The results demonstrated that the proposed framework can provide an effective insight on the spatio-temporal distribution of taxi-passenger demand for a 30 minutes horizon.
************************************
Presenter: Pedro Abreu,
Colaborator of João Mendes Moreira
Title: A Recommender System Applied to a Soccer Environment
Abstract: Collaborative filtering techniques have been used almost exclusively throughout the years in Internet environments, helping users find items they are expected to like, something that is equivalent to finding the same kind of books in a bookstore. Normally, these techniques use the past purchases of the users in order to provide recommendations. With this concept in mind, this research used a collaborative technique to automatically improve the performance of a robotic soccer team. Many studies have attempted to address this problem over the last years. However, these studies have always presented drawbacks in terms of the improvement of the soccer team. Using a collaborative filtering technique based on nearest neighbors and the FC Portugal team as the test subject, simulations were performed for matches between three different teams (performing much better, better and worse from the perspective of the FCPortugal) were simulated. The strategy of FCPortugal was to combine set plays and team formation. The performance of the FC Portugal team improved between 32% to 377%, and these results are quite promising. In the future, this kind of approach will be expanded to other robotic soccer situations, such as the 3D simulation league.
***********************************
2nd Seminar, 30th October, 2013.
***********************************
Presenters: Andre Dias and Pedro Campos
Title: Agent-based modeling in Economics and Management with NetLogo
(joint work with Pavel Brazdil, André Dias and Pedro Amaro).
Abstract: Models in Economics usually assume market equilibrium and constant individual preferences. When these assumptions do not hold, the analysis of mixed levels (individual cognitive level and social level) may constitute the answer to model building. Agent-based modeling and simulation are definitely important techniques aimed at understanding social phenomena. In this talk we focus on two recent typical applications of Agent-based modeling in Economics and Management: (i) the process of new ventures’ creation; and (ii) systemic risk in banking networks. Models have been implemented using NetLogo, an agent-based simulation tool. Model (i) aims to approach the process of creating new ventures, which sets in two main phases: the business opportunity identification by the entrepreneur, considering the various factors that influence the entrepreneurial attitude and business opportunity development. In model (ii), a network of banking relationships in the inter-banking market is created. The goal consists in verifying the existence of tipping points in systemic risk in banking network, has we have observed in recent financial crisis.
************************************
Presenter: Carlos Sáez,
PhD student working with Pedro Rodrigues
Title: Metrics and methods for biomedical data quality assessment
Abstract: Biomedical data require a sufficient level of quality for its reuse. Optimally, researchers would expect stable, free of problems datasets. However, due to the original non-reuse purpose of the data, it does not generally meet these expectations. Additionally, biomedical data is generally multi-dimensional, contains multiple-types and multi-modal variables, and is generated through time and from multiple sources, characteristics which may complicate the process of assessing its quality.
In this presentation we will first introduce the under-development framework for biomedical data quality assessment considering the aforementioned characteristics. It is based on a set of metrics and methods based on data quality dimensions. Then, we will describe the development of a metric to measure the spatial stability among data sources based on a simplicial projection of probability distribution distances. Finally, we will show the current proposals for methods for temporal data quality assessment.
******************************************
1st Seminar, 16th October, 2013
******************************************
Presenter: Alípio Jorge
Title: Classifying Heart Sounds using Multiresolution Time Series Motifs
Abstract: The aim of this work is to describe an exploratory study on the use of a SAX-based Multiresolution Motif Discovery method for Heart Sound Classification. The idea of our work is to discover relevant frequent motifs in the audio signals and use the discovered motifs and their frequency as characterizing attributes. We also describe different configurations of motif discovery for defining attributes and compare the use of a decision tree based algorithm with random forests on this kind of data. Experiments were performed with a dataset obtained from a clinic trial in hospitals using the digital stethoscope DigiScope. This exploratory study suggests that motifs contain valuable information that can be further exploited for Heart Sound Classification.
************************************************
Presenter: Rui Camacho
Title: From Logic Programming to Inductive Logic Programming: a new approach to parallelise ILP systems
Abstract: Inductive Logic Programming (ILP) is a flavour of Multi-relational Data Mining. The use of a powerful representation formalism (First Order Logic) makes ILP suitable to handle data with structure and construct highly complex and comprehensible models. However, ILP systems, exhibit usually very long run times. In this talk we present some of the previous attempts to speed up ILP system and present a new approach, based on parallel execution. The approach is based on a long time and well know technique from AND-parallelism Logic Programming. Apart from the speed up achieved by the parallel execution, a new type of pruning was defined: coverage equivalence pruning. This new type of pruning avoids a substantial number of "useless" clauses to be constructed. An implementation was done by adapting the Aleph system. The new approach has been empirically evaluated and the results are promising.