Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 64) Show all publications
Buendia, R., Kogej, T., Engkvist, O., Carlsson, L., Linusson, H., Johansson, U., . . . Ahlberg, E. (2019). Accurate Hit Estimation for Iterative Screening Using Venn-ABERS Predictors. Journal of Chemical Information and Modeling, 59(3), 1230-1237
Open this publication in new window or tab >>Accurate Hit Estimation for Iterative Screening Using Venn-ABERS Predictors
Show others...
2019 (English)In: Journal of Chemical Information and Modeling, ISSN 1549-9596, E-ISSN 1549-960X, Vol. 59, no 3, p. 1230-1237Article in journal (Refereed) Published
Abstract [en]

Iterative screening has emerged as a promising approach to increase the efficiency of high-throughput screening (HTS) campaigns in drug discovery. By learning from a subset of the compound library, inferences on what compounds to screen next can be made by predictive models. One of the challenges of iterative screening is to decide how many iterations to perform. This is mainly related to difficulties in estimating the prospective hit rate in any given iteration. In this article, a novel method based on Venn - ABERS predictors is proposed. The method provides accurate estimates of the number of hits retrieved in any given iteration during an HTS campaign. The estimates provide the necessary information to support the decision on the number of iterations needed to maximize the screening outcome. Thus, this method offers a prospective screening strategy for early-stage drug discovery.

Place, publisher, year, edition, pages
American Chemical Society (ACS), 2019
National Category
Computer Sciences
Identifiers
urn:nbn:se:hj:diva-43510 (URN)10.1021/acs.jcim.8b00724 (DOI)000462943700027 ()30726080 (PubMedID)2-s2.0-85063371683 (Scopus ID);JTHDatateknikIS (Local ID);JTHDatateknikIS (Archive number);JTHDatateknikIS (OAI)
Funder
Knowledge Foundation, 20150185
Available from: 2019-04-23 Created: 2019-04-23 Last updated: 2019-08-22Bibliographically approved
Johansson, U., Löfström, T. & Boström, H. (2019). Calibrating probability estimation trees using Venn-Abers predictors. In: SIAM International Conference on Data Mining, SDM 2019: . Paper presented at 19th SIAM International Conference on Data Mining, SDM 2019, Hyatt Regency Calgary, Calgary, Canada, 2 - 4 May 2019 (pp. 28-36). Society for Industrial and Applied Mathematics
Open this publication in new window or tab >>Calibrating probability estimation trees using Venn-Abers predictors
2019 (English)In: SIAM International Conference on Data Mining, SDM 2019, Society for Industrial and Applied Mathematics, 2019, p. 28-36Conference paper, Published paper (Refereed)
Abstract [en]

Class labels output by standard decision trees are not very useful for making informed decisions, e.g., when comparing the expected utility of various alternatives. In contrast, probability estimation trees (PETs) output class probability distributions rather than single class labels. It is well known that estimating class probabilities in PETs by relative frequencies often lead to extreme probability estimates, and a number of approaches to provide more well-calibrated estimates have been proposed. In this study, a recent model-agnostic calibration approach, called Venn-Abers predictors is, for the first time, considered in the context of decision trees. Results from a large-scale empirical investigation are presented, comparing the novel approach to previous calibration techniques with respect to several different performance metrics, targeting both predictive performance and reliability of the estimates. All approaches are considered both with and without Laplace correction. The results show that using Venn-Abers predictors for calibration is a highly competitive approach, significantly outperforming Platt scaling, Isotonic regression and no calibration, with respect to almost all performance metrics used, independently of whether Laplace correction is applied or not. The only exception is AUC, where using non-calibrated PETs together with Laplace correction, actually is the best option, which can be explained by the fact that AUC is not affected by the absolute, but only relative, values of the probability estimates. 

Place, publisher, year, edition, pages
Society for Industrial and Applied Mathematics, 2019
Keywords
Calibration, Data mining, Decision trees, Forestry, Laplace transforms, Calibration techniques, Class probabilities, Empirical investigation, Performance metrics, Predictive performance, Probability estimate, Probability estimation trees, Relative frequencies, Probability distributions
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:hj:diva-44355 (URN)10.1137/1.9781611975673.4 (DOI)2-s2.0-85066082095 (Scopus ID)9781611975673 (ISBN)
Conference
19th SIAM International Conference on Data Mining, SDM 2019, Hyatt Regency Calgary, Calgary, Canada, 2 - 4 May 2019
Available from: 2019-06-11 Created: 2019-06-11 Last updated: 2019-08-22Bibliographically approved
Johansson, U., Löfström, T., Linusson, H. & Boström, H. (2019). Efficient Venn Predictors using Random Forests. Machine Learning, 108(3), 535-550
Open this publication in new window or tab >>Efficient Venn Predictors using Random Forests
2019 (English)In: Machine Learning, ISSN 0885-6125, E-ISSN 1573-0565, Vol. 108, no 3, p. 535-550Article in journal (Refereed) Published
Abstract [en]

Successful use of probabilistic classification requires well-calibrated probability estimates, i.e., the predicted class probabilities must correspond to the true probabilities. In addition, a probabilistic classifier must, of course, also be as accurate as possible. In this paper, Venn predictors, and its special case Venn-Abers predictors, are evaluated for probabilistic classification, using random forests as the underlying models. Venn predictors output multiple probabilities for each label, i.e., the predicted label is associated with a probability interval. Since all Venn predictors are valid in the long run, the size of the probability intervals is very important, with tighter intervals being more informative. The standard solution when calibrating a classifier is to employ an additional step, transforming the outputs from a classifier into probability estimates, using a labeled data set not employed for training of the models. For random forests, and other bagged ensembles, it is, however, possible to use the out-of-bag instances for calibration, making all training data available for both model learning and calibration. This procedure has previously been successfully applied to conformal prediction, but was here evaluated for the first time for Venn predictors. The empirical investigation, using 22 publicly available data sets, showed that all four versions of the Venn predictors were better calibrated than both the raw estimates from the random forest, and the standard techniques Platt scaling and isotonic regression. Regarding both informativeness and accuracy, the standard Venn predictor calibrated on out-of-bag instances was the best setup evaluated. Most importantly, calibrating on out-of-bag instances, instead of using a separate calibration set, resulted in tighter intervals and more accurate models on every data set, for both the Venn predictors and the Venn-Abers predictors.

Place, publisher, year, edition, pages
Springer, 2019
Keywords
Probabilistic prediction, Venn predictors, Venn-Abers predictors, Random forests, Out-of-bag calibration
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:hj:diva-41127 (URN)10.1007/s10994-018-5753-x (DOI)000459945900008 ()2-s2.0-85052523706 (Scopus ID)HOA JTH 2019 (Local ID)HOA JTH 2019 (Archive number)HOA JTH 2019 (OAI)
Available from: 2018-08-13 Created: 2018-08-13 Last updated: 2019-08-22Bibliographically approved
Löfström, T., Johansson, U., Balkow, J. & Sundell, H. (2018). A data-driven approach to online fitting services. In: Liu, J, Lu, J, Xu, Y, Martinez, L & Kerre, EE (Ed.), Data Science And Knowledge Engineering For Sensing Decision Support: . Paper presented at 13th International Conference on Fuzzy Logic and Intelligent Technologies in Nuclear Science (FLINS), Belfast, Ireland, 21-24 August, 2018 (pp. 1559-1566). World Scientific, 11
Open this publication in new window or tab >>A data-driven approach to online fitting services
2018 (English)In: Data Science And Knowledge Engineering For Sensing Decision Support / [ed] Liu, J, Lu, J, Xu, Y, Martinez, L & Kerre, EE, World Scientific, 2018, Vol. 11, p. 1559-1566Conference paper, Published paper (Refereed)
Abstract [en]

Being able to accurately predict several attributes related to size is vital for services supporting online fitting. In this paper, we investigate a data-driven approach, while comparing two different supervised modeling techniques for predictive regression; standard multiple linear regression and neural networks. Using a fairly large, publicly available, data set of high quality, the main results are somewhat discouraging. Specifically, it is questionable whether key attributes like sleeve length, neck size, waist and chest can be modeled accurately enough using easily accessible input variables as sex, weight and height. This is despite the fact that several services online offer exactly this functionality. For this specific task, the results show that standard linear regression was as accurate as the potentially more powerful neural networks. Most importantly, comparing the predictions to reasonable levels for acceptable errors, it was found that an overwhelming majority of all instances had at least one attribute with an unacceptably high prediction error. In fact, if requiring that all variables are predicted with an acceptable accuracy, less than 5 % of all instances met that criterion. Specifically, for females, the success rate was as low as 1.8 %.

Place, publisher, year, edition, pages
World Scientific, 2018
Series
World Scientific Proceedings Series on Computer Engineering and Information Science ; 11
Keywords
Predictive regression; online fitting; fashion
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:hj:diva-44183 (URN)10.1142/9789813273238_0194 (DOI)000468160600194 ()978-981-3273-24-5 (ISBN)978-981-3273-22-1 (ISBN)
Conference
13th International Conference on Fuzzy Logic and Intelligent Technologies in Nuclear Science (FLINS), Belfast, Ireland, 21-24 August, 2018
Available from: 2019-06-11 Created: 2019-06-11 Last updated: 2019-08-22Bibliographically approved
Linusson, H., Johansson, U., Boström, H. & Löfström, T. (2018). Classification with reject option using conformal prediction. In: Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part I. Paper presented at 22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018; Melbourne; Australia; 3 June 2018 through 6 June 2018 (pp. 94-105). Springer
Open this publication in new window or tab >>Classification with reject option using conformal prediction
2018 (English)In: Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part I, Springer, 2018, p. 94-105Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we propose a practically useful means of interpreting the predictions produced by a conformal classifier. The proposed interpretation leads to a classifier with a reject option, that allows the user to limit the number of erroneous predictions made on the test set, without any need to reveal the true labels of the test objects. The method described in this paper works by estimating the cumulative error count on a set of predictions provided by a conformal classifier, ordered by their confidence. Given a test set and a user-specified parameter k, the proposed classification procedure outputs the largest possible amount of predictions containing on average at most k errors, while refusing to make predictions for test objects where it is too uncertain. We conduct an empirical evaluation using benchmark datasets, and show that we are able to provide accurate estimates for the error rate on the test set. 

Place, publisher, year, edition, pages
Springer, 2018
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 10937
Keywords
Data mining, Errors, Forecasting, Testing, Uncertainty analysis, Benchmark datasets, Classification procedure, Conformal predictions, Cumulative errors, Empirical evaluations, Error rate, Test object, Test sets, Classification (of information)
National Category
Computer Sciences
Identifiers
urn:nbn:se:hj:diva-41260 (URN)10.1007/978-3-319-93034-3_8 (DOI)000443224400008 ()2-s2.0-85049360232 (Scopus ID)9783319930336 (ISBN)
Conference
22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018; Melbourne; Australia; 3 June 2018 through 6 June 2018
Funder
Knowledge Foundation
Available from: 2018-08-27 Created: 2018-08-27 Last updated: 2019-08-22Bibliographically approved
Sundell, H., Löfström, T. & Johansson, U. (2018). Explorative multi-objective optimization of marketing campaigns for the fashion retail industry. In: Liu, J, Lu, J, Xu, Y, Martinez, L & Kerre, EE (Ed.), Data Science And Knowledge Engineering For Sensing Decision Support: . Paper presented at 13th International Conference on Fuzzy Logic and Intelligent Technologies in Nuclear Science (FLINS), Belfast, Ireland, 21-24 August, 2018 (pp. 1551-1558). World Scientific, 11
Open this publication in new window or tab >>Explorative multi-objective optimization of marketing campaigns for the fashion retail industry
2018 (English)In: Data Science And Knowledge Engineering For Sensing Decision Support / [ed] Liu, J, Lu, J, Xu, Y, Martinez, L & Kerre, EE, World Scientific, 2018, Vol. 11, p. 1551-1558Conference paper, Published paper (Refereed)
Abstract [en]

We show how an exploratory tool for association rule mining can be used for efficient multi-objective optimization of marketing campaigns for companies within the fashion retail industry. We have earlier designed and implemented a novel digital tool for mining of association rules from given basket data. The tool supports efficient finding of frequent itemsets over multiple hierarchies and interactive visualization of corresponding association rules together with numerical attributes. Normally when optimizing a marketing campaign, factors that cause an increased level of activation among the recipients could in fact reduce the profit, i.e., these factors need to be balanced, rather than optimized individually. Using the tool we can identify important factors that influence the search for an optimal campaign in respect to both activation and pro fit. We show empirical results from a real-world case-study using campaign data from a well-established company within the fashion retail industry, demonstrating how activation and profit can be simultaneously targeted, using computer-generated algorithms as well as human-controlled visualization.

Place, publisher, year, edition, pages
World Scientific, 2018
Series
World Scientific Proceedings Series on Computer Engineering and Information Science ; 11
Keywords
Association rules; marketing; visualization; Pareto front
National Category
Computer Sciences
Identifiers
urn:nbn:se:hj:diva-44349 (URN)10.1142/9789813273238_0193 (DOI)000468160600193 ()978-981-3273-24-5 (ISBN)978-981-3273-22-1 (ISBN)
Conference
13th International Conference on Fuzzy Logic and Intelligent Technologies in Nuclear Science (FLINS), Belfast, Ireland, 21-24 August, 2018
Available from: 2019-06-11 Created: 2019-06-11 Last updated: 2019-08-22Bibliographically approved
Johansson, U., Linusson, H., Löfström, T. & Boström, H. (2018). Interpretable regression trees using conformal prediction. Expert systems with applications, 97, 394-404
Open this publication in new window or tab >>Interpretable regression trees using conformal prediction
2018 (English)In: Expert systems with applications, ISSN 0957-4174, E-ISSN 1873-6793, Vol. 97, p. 394-404Article in journal (Refereed) Published
Abstract [en]

A key property of conformal predictors is that they are valid, i.e., their error rate on novel data is bounded by a preset level of confidence. For regression, this is achieved by turning the point predictions of the underlying model into prediction intervals. Thus, the most important performance metric for evaluating conformal regressors is not the error rate, but the size of the prediction intervals, where models generating smaller (more informative) intervals are said to be more efficient. State-of-the-art conformal regressors typically utilize two separate predictive models: the underlying model providing the center point of each prediction interval, and a normalization model used to scale each prediction interval according to the estimated level of difficulty for each test instance. When using a regression tree as the underlying model, this approach may cause test instances falling into a specific leaf to receive different prediction intervals. This clearly deteriorates the interpretability of a conformal regression tree compared to a standard regression tree, since the path from the root to a leaf can no longer be translated into a rule explaining all predictions in that leaf. In fact, the model cannot even be interpreted on its own, i.e., without reference to the corresponding normalization model. Current practice effectively presents two options for constructing conformal regression trees: to employ a (global) normalization model, and thereby sacrifice interpretability; or to avoid normalization, and thereby sacrifice both efficiency and individualized predictions. In this paper, two additional approaches are considered, both employing local normalization: the first approach estimates the difficulty by the standard deviation of the target values in each leaf, while the second approach employs Mondrian conformal prediction, which results in regression trees where each rule (path from root node to leaf node) is independently valid. An empirical evaluation shows that the first approach is as efficient as current state-of-the-art approaches, thus eliminating the efficiency vs. interpretability trade-off present in existing methods. Moreover, it is shown that if a validity guarantee is required for each single rule, as provided by the Mondrian approach, a penalty with respect to efficiency has to be paid, but it is only substantial at very high confidence levels.

Place, publisher, year, edition, pages
Elsevier, 2018
Keywords
Conformal prediction, Interpretability, Predictive regression, Regression trees, Economic and social effects, Efficiency, Forestry, Query processing, Regression analysis, Conformal predictions, Conformal predictors, Empirical evaluations, Level of difficulties, State-of-the-art approach, Forecasting
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:hj:diva-38624 (URN)10.1016/j.eswa.2017.12.041 (DOI)000425074100030 ()2-s2.0-85040125577 (Scopus ID)
Available from: 2018-01-22 Created: 2018-01-22 Last updated: 2019-08-22Bibliographically approved
Johansson, U., Löfström, T., Sundell, H., Linusson, H., Gidenstam, A. & Boström, H. (2018). Venn predictors for well-calibrated probability estimation trees. In: A. Gammerman, V. Vovk, Z. Luo, E. Smirnov, & R. Peeters (Ed.), Conformal and Probabilistic Prediction and Applications: . Paper presented at Proceedings of the Seventh Workshop on Conformal and Probabilistic Prediction and Applications, 11-13 June 2018 (pp. 3-14).
Open this publication in new window or tab >>Venn predictors for well-calibrated probability estimation trees
Show others...
2018 (English)In: Conformal and Probabilistic Prediction and Applications / [ed] A. Gammerman, V. Vovk, Z. Luo, E. Smirnov, & R. Peeters, 2018, p. 3-14Conference paper, Published paper (Refereed)
Abstract [en]

Successful use of probabilistic classification requires well-calibrated probability estimates, i.e., the predicted class probabilities must correspond to the true probabilities. The standard solution is to employ an additional step, transforming the outputs from a classifier into probability estimates. In this paper, Venn predictors are compared to Platt scaling and isotonic regression, for the purpose of producing well-calibrated probabilistic predictions from decision trees. The empirical investigation, using 22 publicly available data sets, showed that the probability estimates from the Venn predictor were extremely well-calibrated. In fact, in a direct comparison using the accepted reliability metric, the Venn predictor estimates were the most exact on every data set.

Series
Proceedings of Machine Learning Research, ISSN 2640-3498 ; 91
Keywords
Venn predictors, Calibration, Decision trees, Reliability
National Category
Computer Sciences
Identifiers
urn:nbn:se:hj:diva-43505 (URN)
Conference
Proceedings of the Seventh Workshop on Conformal and Probabilistic Prediction and Applications, 11-13 June 2018
Available from: 2019-04-23 Created: 2019-04-23 Last updated: 2019-08-22Bibliographically approved
Johansson, U., Löfström, T. & Sundell, H. (2018). Venn predictors using lazy learners. In: R. Stahlbock, G. M. Weiss & M. Abou-Nasr (Ed.), Proceedings of the 2018 International Conference on Data Science, ICDATA'18: . Paper presented at The 2018 World Congress in Computer Science, Computer Engineering & Applied Computing, July 30 - August 02, Las Vegas, Nevada, USA (pp. 220-226). CSREA Press
Open this publication in new window or tab >>Venn predictors using lazy learners
2018 (English)In: Proceedings of the 2018 International Conference on Data Science, ICDATA'18 / [ed] R. Stahlbock, G. M. Weiss & M. Abou-Nasr, CSREA Press, 2018, p. 220-226Conference paper, Published paper (Refereed)
Abstract [en]

Probabilistic classification requires well-calibrated probability estimates, i.e., the predicted class probabilities must correspond to the true probabilities. Venn predictors, which can be used on top of any classifier, are automatically valid multiprobability predictors, making them extremely suitable for probabilistic classification. A Venn predictor outputs multiple probabilities for each label, so the predicted label is associated with a probability interval. While all Venn predictors are valid, their accuracy and the size of the probability interval are dependent on both the underlying model and some interior design choices. Specifically, all Venn predictors use so called Venn taxonomies for dividing the instances into a number of categories, each such taxonomy defining a different Venn predictor. A frequently used, but very basic taxonomy, is to categorize the instances based on their predicted label. In this paper, we investigate some more finegrained taxonomies, that use not only the predicted label but also some measures related to the confidence in individual predictions. The empirical investigation, using 22 publicly available data sets and lazy learners (kNN) as the underlying models, showed that the probability estimates from the Venn predictors, as expected, were extremely well-calibrated. Most importantly, using the basic (i.e., label-based) taxonomy produced significantly more accurate and informative Venn predictors compared to the more complex alternatives. In addition, the results also showed that when using lazy learners as underlying models, a transductive approach significantly outperformed an inductive, with regard to accuracy and informativeness. This result is in contrast to previous studies, where other underlying models were used.

Place, publisher, year, edition, pages
CSREA Press, 2018
National Category
Computer Sciences
Identifiers
urn:nbn:se:hj:diva-43506 (URN)1-60132-481-2 (ISBN)
Conference
The 2018 World Congress in Computer Science, Computer Engineering & Applied Computing, July 30 - August 02, Las Vegas, Nevada, USA
Funder
Knowledge Foundation, 20150185
Available from: 2019-04-23 Created: 2019-04-23 Last updated: 2019-08-22Bibliographically approved
Boström, H., Linusson, H., Löfström, T. & Johansson, U. (2017). Accelerating difficulty estimation for conformal regression forests. Annals of Mathematics and Artificial Intelligence, 81(1-2), 125-144
Open this publication in new window or tab >>Accelerating difficulty estimation for conformal regression forests
2017 (English)In: Annals of Mathematics and Artificial Intelligence, ISSN 1012-2443, E-ISSN 1573-7470, Vol. 81, no 1-2, p. 125-144Article in journal (Refereed) Published
Abstract [en]

The conformal prediction framework allows for specifying the probability of making incorrect predictions by a user-provided confidence level. In addition to a learning algorithm, the framework requires a real-valued function, called nonconformity measure, to be specified. The nonconformity measure does not affect the error rate, but the resulting efficiency, i.e., the size of output prediction regions, may vary substantially. A recent large-scale empirical evaluation of conformal regression approaches showed that using random forests as the learning algorithm together with a nonconformity measure based on out-of-bag errors normalized using a nearest-neighbor-based difficulty estimate, resulted in state-of-the-art performance with respect to efficiency. However, the nearest-neighbor procedure incurs a significant computational cost. In this study, a more straightforward nonconformity measure is investigated, where the difficulty estimate employed for normalization is based on the variance of the predictions made by the trees in a forest. A large-scale empirical evaluation is presented, showing that both the nearest-neighbor-based and the variance-based measures significantly outperform a standard (non-normalized) nonconformity measure, while no significant difference in efficiency between the two normalized approaches is observed. The evaluation moreover shows that the computational cost of the variance-based measure is several orders of magnitude lower than when employing the nearest-neighbor-based nonconformity measure. The use of out-of-bag instances for calibration does, however, result in nonconformity scores that are distributed differently from those obtained from test instances, questioning the validity of the approach. An adjustment of the variance-based measure is presented, which is shown to be valid and also to have a significant positive effect on the efficiency. For conformal regression forests, the variance-based nonconformity measure is hence a computationally efficient and theoretically well-founded alternative to the nearest-neighbor procedure. 

Place, publisher, year, edition, pages
Springer, 2017
Keywords
Conformal prediction, Nonconformity measures, Random forests, Regression
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:hj:diva-35193 (URN)10.1007/s10472-017-9539-9 (DOI)000407425000008 ()2-s2.0-85014124316 (Scopus ID)
Available from: 2017-03-13 Created: 2017-03-13 Last updated: 2019-08-22Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0412-6199

Search in DiVA

Show all publications