How to Estimate Probabilities From Data for Continuous Attributes
Probability Estimate
The Windham Portfolio Advisor
Mark Kritzman , in Optimizing Optimization, 2010
End-of-horizon exposure to loss
We estimate probability of loss at the end of the horizon by: (1) calculating the difference between the cumulative percentage loss and the cumulative expected return, (2) dividing this difference by the cumulative standard deviation, and (3) applying the normal distribution function to convert this standardized distance from the mean to a probability estimate, as shown in Equation (4.1).
(4.1)
where
-
N[ ] = cumulative normal distribution function;
-
ln = natural logarithm;
-
L = cumulative percentage loss in periodic units;
-
μ = annualized expected return in continuous units;
-
T = number of years in horizon;
-
σ = annualized standard deviation of continuous returns.
The process of compounding causes periodic returns to be lognormally distributed. The continuous counterparts of these periodic returns are normally distributed, which is why the inputs to the normal distribution function are in continuous units.
When we estimate value at risk, we turn this calculation around by specifying the probability and solving for the loss amount, as shown:
(4.2)
where
-
e = base of natural logarithm (2.718282);
-
Z = normal deviate associated with chosen probability;
-
W = initial wealth.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B978012374952900004X
Algorithms
Ian H. Witten , ... Christopher J. Pal , in Data Mining (Fourth Edition), 2017
Linear Classification Using the Perceptron
Logistic regression attempts to produce accurate probability estimates by maximizing the probability of the training data. Of course, accurate probability estimates lead to accurate classifications. However, it is not necessary to perform probability estimation if the sole purpose of the model is to predict class labels. A different approach is to learn a hyperplane that separates the instances pertaining to the different classes—let's assume that there are only two of them. If the data can be separated perfectly into two groups using a hyperplane, it is said to be linearly separable. It turns out that if the data is linearly separable, there is a very simple algorithm for finding a separating hyperplane.
The algorithm is called the perceptron learning rule. Before looking at it in detail, let's examine the equation for a hyperplane again:
Here, a 1, a 2,…, a k are the attribute values, and w 0, w 1,…, w k are the weights that define the hyperplane. We will assume that each training instance a 1, a 2,… is extended by an additional attribute a 0 that always has the value 1 (as we did in the case of linear regression). This extension, which is called the bias, just means that we don't have to include an additional constant element in the sum. If the sum is greater than zero, we will predict the first class; otherwise, we will predict the second class. We want to find values for the weights so that the training data is correctly classified by the hyperplane.
Fig. 4.11A gives the perceptron learning rule for finding a separating hyperplane. The algorithm iterates until a perfect solution has been found, but it will only work properly if a separating hyperplane exists, i.e., if the data is linearly separable. Each iteration goes through all the training instances. If a misclassified instance is encountered, the parameters of the hyperplane are changed so that the misclassified instance moves closer to the hyperplane or maybe even across the hyperplane onto the correct side. If the instance belongs to the first class, this is done by adding its attribute values to the weight vector; otherwise, they are subtracted from it.
To see why this works, consider the situation after an instance a pertaining to the first class has been added:
This means the output for a has increased by
This number is always positive. Thus the hyperplane has moved in the correct direction for classifying instance a as positive. Conversely, if an instance belonging to the second class is misclassified, the output for that instance decreases after the modification, again moving the hyperplane in the correct direction.
These corrections are incremental, and can interfere with earlier updates. However, it can be shown that the algorithm converges into a finite number of iterations if the data is linearly separable. Of course, if the data is not linearly separable, the algorithm will not terminate, so an upper bound needs to be imposed on the number of iterations when this method is applied in practice.
The resulting hyperplane is called a perceptron, and it's the grandfather of neural networks (we return to neural networks in Section 7.2 and chapter: Deep learning). Fig. 4.11B represents the perceptron as a graph with nodes and weighted edges, imaginatively termed a "network" of "neurons." There are two layers of nodes: input and output. The input layer has one node for every attribute, plus an extra node that is always set to one. The output layer consists of just one node. Every node in the input layer is connected to the output layer. The connections are weighted, and the weights are those numbers found by the perceptron learning rule.
When an instance is presented to the perceptron, its attribute values serve to "activate" the input layer. They are multiplied by the weights and summed up at the output node. If the weighted sum is greater than 0 the output signal is 1, representing the first class; otherwise, it is −1, representing the second.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128042915000040
The Explorer
Ian H. Witten , ... Mark A. Hall , in Data Mining (Third Edition), 2011
Combining Classifiers
Vote provides a baseline method for combining classifiers. The default scheme is to average their probability estimates or numeric predictions, for classification and regression, respectively. Other combination schemes are available—for example, using majority voting for classification. MultiScheme selects the best classifier from a set of candidates using cross-validation of percentage accuracy or mean-squared error for classification and regression, respectively. The number of folds is a parameter. Performance on training data can be used instead.
Stacking combines classifiers using stacking (see Section 8.7, page 369) for both classification and regression problems. You specify the base classifiers, the metalearner, and the number of cross-validation folds. StackingC implements a more efficient variant for which the metalearner must be a numeric prediction scheme (Seewald, 2002). In Grading, the inputs to the metalearner are base-level predictions that have been marked (i.e., "graded") as correct or incorrect. For each base classifier, a metalearner is learned that predicts when the base classifier will err. Just as stacking may be viewed as a generalization of voting, grading generalizes selection by cross-validation (Seewald and Fürnkranz, 2001).
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123748560000110
Quality Software Development
William G. Bail , in Advances in Computers, 2006
3.2 Risk Likelihood
Because risks are potential, that is, have not yet happened, there is uncertainty associated with their occurrence. We refer to the level of uncertainty as the likelihood of the risk. More precisely, the uncertainty is characterized by the likelihood that the trigger event will take place, thereby altering the status of the project in some way that will result in an adverse effect.
When we estimate the likelihood of occurrence, obtaining a high degree of accuracy is generally not possible. Determining the likelihood is not a definitive process. Rather, it is a function of the nature of the risk as well as the features of the project plan. There have been many guidelines created to assist in estimating the likelihood of risks. These are generally based on environmental factors, personnel factors, and technical factors.
There are many variations of how risk likelihood is represented. In some cases, projects use a probability estimate, ranging from 0.0 for impossible events, to 1.0 for certain events. In others, a simple High, Medium, and Low estimate is used. In still others, a category-based layering is used, with the categories being something like:
- ●
-
Very unlikely—very low probability of happening.
- ●
-
Unlikely—low probability but possible.
- ●
-
Possible—may occur but not guaranteed.
- ●
-
Likely—will probably occur, but not guaranteed.
- ●
-
Very likely—highly probable of occurring at some point.
It is generally not advisable to define a large number of likelihood levels. For example, projects that attempt to define the probabilities with two digits of accuracy generally discover that such fine grained differentiation is impossible to achieve, and often results in spending more time on deliberations estimating the probabilities than on addressing the risk itself (see tetrapyloctomy 2 ). Based on collective experience, projects generally settle for somewhere between three and five levels of likelihood.
In addition to the likelihood of occurrence, there is also the aspect of frequency of occurrence. In some cases, the risk might not be whether a specific event occurs, but rather how often the event occurs. Generally, such risks can be defined by the frequency levels. That is, instead of the risk event being "The developer's work station will fail, requiring a reboot," it would be stated as "The developer's work station will fail more than ten times per day, requiring multiple reboots."
It is important however to note that risks do not generally have a fixed likelihood of occurrence. Most risks have a varying likelihood that changes over time. Due to conditions inside of and external to the project, at times the risk may be more likely to occur, and at times it will be less likely to occur. Figure 3 illustrates this pattern.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/S0065245805660044
Video coding standards and formats
David R. Bull , Fan Zhang , in Intelligent Image and Video Compression (Second Edition), 2021
12.9.8 Entropy coding
VVC has further enhanced CABAC entropy by enabling context models and probabilities to be updated using a multihypothesis approach. This independently updates two probability estimates associated with each context model with pretrained adaptation rates. This replaces the look-up tables used in HEVC.
VVC also uses a different residual coding structure for transform coefficients to achieve more effective energy compaction. This is particularly useful for coding screen content. VVC also supports context modeling for coefficient coding, which selects probability models based on their absolute values and partially reconstructed absolute values from the neighbors of the current block.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128203538000219
Data transformations
Ian H. Witten , ... Christopher J. Pal , in Data Mining (Fourth Edition), 2017
Ensembles of Nested Dichotomies
Error-correcting output codes often produce accurate classifiers for multiclass problems. However, the basic algorithm produces classifications whereas often we would like class probability estimates as well—e.g., to perform cost-sensitive classification using the minimum expected cost approach discussed in Section 5.8. Fortunately, there is a method for decomposing multiclass problems into two-class ones that provides a natural way of computing class probability estimates, so long as the underlying two-class models are able to produce probabilities for the corresponding two-class subtasks.
The idea is to recursively split the full set of classes from the original multiclass problem into smaller and smaller subsets, while splitting the full dataset of instances into subsets corresponding to these subsets of classes. This yields a binary tree of classes. Consider the hypothetical 4-class problem discussed earlier. At the root node are the full set of classes {a, b, c, d}, which are split into disjoint subsets, say {a, c} and {b, d}, along with the instances pertaining to these two subsets of classes. The two subsets form the two successor nodes in the binary tree. These subsets are then split further into one-element sets, yielding successors {a} and {c} for the node {a, c}, and successors {b} and {d} for the node {b, d}. Once we reach one-element subsets, the splitting process stops.
The resulting binary tree of classes is called a nested dichotomy because each internal node and its two successors define a dichotomy—e.g., discriminating between classes {a, c} and {b, d} at the root node—and the dichotomies are nested in a hierarchy. We can view a nested dichotomy as a particular type of sparse output code. Table 8.3 shows the output code matrix for the example just discussed. There is one dichotomy for each internal node of the tree structure. Hence, given that the example involves three internal nodes, there are three columns in the code matrix. In contrast to the class vectors considered above, the matrix contains elements marked X that indicate that instances of the corresponding classes are simply omitted from the associated two-class learning problems.
Class | Class Vector |
---|---|
a | 0 0 X |
b | 1 X 0 |
c | 0 1 X |
d | 1 X 1 |
What is the advantage of this kind of output code? It turns out that, because the decomposition is hierarchical and yields disjoint subsets, there is a simple method for computing class probability estimates for each element in the original set of multiple classes, assuming two-class estimates for each dichotomy in the hierarchy. The reason is the chain rule from probability theory, which we will encounter again when discussing Bayesian networks in Section 9.2.
Suppose we want to compute the probability for class a given a particular instance x, i.e., the conditional probability P(a|x). This class corresponds to one of the four leaf nodes in the hierarchy of classes in the above example. First, we learn two-class models that yield class probability estimates for the three two-class datasets at the internal nodes of the hierarchy. Then, from the two-class model at the root node, an estimate of the conditional probability P({a, c}|x)—namely, that x belongs to either a or c—can be obtained. Moreover, we can obtain an estimate of P({a}|x, {a, c})—the probability that x belongs to a given that we already know that it belongs to either a or c—from the model that discriminates between the one-element sets {a} and {c}. Now, based on the chain rule, P({a}|x)=P({a}|{a, c}, x)×P({a, c}|x). Hence to compute the probability for any individual class of the original multiclass problem—any leaf node in the tree of classes—we simply multiply together the probability estimates collected from the internal nodes encountered when proceeding from the root node to this leaf node: the probability estimates for all subsets of classes that contain the target class.
Assuming that the individual two-class models at the internal nodes produce accurate probability estimates, there is reason to believe that the multiclass probability estimates obtained using the chain rule will generally be accurate. However, it is clear that estimation errors will accumulate, causing problems for very deep hierarchies. A more basic issue is that in the above example we arbitrarily decided on a particular hierarchical decomposition of the classes. Perhaps there is some background knowledge regarding the domain concerned, in which case one particular hierarchy may be preferable because certain classes are known to be related, but this is generally not the case.
What can be done? If there is no reason a priori to prefer any particular decomposition, perhaps all of them should be considered, yielding an ensemble of nested dichotomies. Unfortunately, for any nontrivial number of classes there are too many potential dichotomies, making an exhaustive approach infeasible. But we could consider a subset, taking a random sample of possible tree structures, building two-class models for each internal node of each tree structure (with caching of models, given that the same two-class problem may occur in multiple trees), and then averaging the probability estimates for each individual class to obtain the final estimates.
Empirical experiments show that this approach yields accurate multiclass classifiers and is able to improve predictive performance even in the case of classifiers, such as decision trees, that can deal with multiclass problems directly. In contrast to standard error-correcting output codes, the technique often works well even when the base learner is unable to model complex decision boundaries. The reason is that, generally speaking, learning is easier with fewer classes, so results become more successful the closer we get to the leaf nodes in the tree. This also explains why the pairwise classification technique described earlier works particularly well for simple models such as ones corresponding to hyperplanes: it creates the simplest possible dichotomies! Nested dichotomies appear to strike a useful balance between the simplicity of the learning problems that occur in pairwise classification—after all, the lowest-level dichotomies involve pairs of individual classes—and the power of the redundancy embodied in standard error-correcting output codes.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128042915000088
JPEG and JPEG2000
Rashid Ansari , ... Nasir Memon , in The Essential Guide to Image Processing, 2009
17.5.3 Entropy Coding
The symbols defined for DC and AC coefficients are entropy coded using mostly Huffman coding or, optionally and infrequently, arithmetic coding based on the probability estimates of the symbols. Huffman coding is a method of VLC in which shorter code words are assigned to the more frequently occurring symbols in order to achieve an averagesymbol code word length that is as close to the symbol source entropy as possible. Huffman coding is optimal (meets the entropy bound) only when the symbol probabilities are integral powers of 1/2. The technique of arithmetic coding [42] provides asolution to attaining the theoretical bound of the source entropy. The baseline implementation of the JPEG standard uses Huffman coding only.
If Huffman coding is used, then Huffman tables, up to a maximum of eight in number, are specified in the bitstream. The tables constructed should not contain code words that (a) are more than 16 bits long or (b) consist of all ones. Recommended tables are listed in annex K of the standard. If these tables are applied to the output of the quantizer shown in the first two columns of Fig. 17.9, then the algorithm produces output bits shown in the following columns of the figure. The procedures for specification and generation of the Huffman tables are identical to the ones used in the lossless standard [25].
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123744579000172
Waveform selection for multistatic tracking of a maneuvering target
Ngoc Hung Nguyen , Kutluyıl Doğançay , in Signal Processing for Multistatic Radar Systems, 2020
2.4.3 Interacting multiple model – extended Kalman filter
Since the radar measurement vector is a nonlinear function of the target state (see (2.1), (2.2) and (2.16)–(2.19)), a nonlinear tracking algorithm must be used. Among various options available, the extended Kalman filter (EKF) is computationally most efficient compared to other more sophisticated nonlinear Kalman filtering algorithms such as the sigma-point Kalman filters including the unscented and cubature Kalman filters, and the particle filters. For the target tracking problem under consideration, the EKF can produce a good tracking performance thanks to the advantage of spatial diversity given by multistatic radar. The procedure for calculating the EKF estimates for the mth target dynamic model is given by
(2.22a)
(2.22b)
(2.22c)
(2.22d)
(2.22e)
(2.22f)
(2.22g)
where and are the state estimate and error covariance, respectively, at time instant k given measurements through time instants, , and is the Jacobian matrix of with respect to evaluated at . The expression for the Jacobian matrix is provided in Section 2.8 (Appendix).
To track a maneuvering target, we employ the IMM algorithm, which can incorporate multiple models of target dynamics. The IMM algorithm is a technique of combining multiple state hypotheses by running multiple filters in parallel to obtain a better state estimate for targets with changing dynamics [23,24,35,43]. Specifically, the IMM algorithm treats the dynamic motion of the target as multiple switching models:
(2.23)
where is a finite-state Markov chain ( ) which follows the transition probabilities for switching from model l to model m and the covariance matrix of the process noise is governed by [35].
Given the model-conditioned state estimate and error covariance and the model probability estimate (denoted as ) for all models from previous time instant k, the state estimation of the IMM algorithm at time instant proceeds as follows.
- 1.
-
Mixing of state estimates:
- –
-
Compute the predicted model probabilities
(2.24)
- –
-
Compute the conditional model probabilities
(2.25)
- –
-
Compute the mixed estimates and covariances
(2.26a)
(2.26b)
- 2.
-
Model-conditioned updates:
- -
-
The mixed estimate and covariance are used as inputs for the mth EKF to compute the state estimate and error covariance at time instant .
- 3.
-
Model likelihood computations:
(2.27)
where denotes the determinant. - 4.
-
Model probability updates
(2.28)
where κ is the normalization factor given by(2.29)
- 5.
-
Combination of state estimates:
(2.30a)
(2.30b)
where .
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B978012815314700010X
Mathematical foundations of social sensing
Dong Wang , ... Lance Kaplan , in Social Sensing, 2015
3.3 Basics of Bayesian Analysis
Bayesian analysis (also called Bayesian inference or Bayesian statistics) is a method of analysis where the Bayes' rule is used to update the probability estimate of a hypothesis as additional data is observed [ 89]. In this section, we review the basics of Bayesian analysis. Let us start with a simple example. Suppose a 50-year-old patient has a positive lab test for a cardiac disease. Assume the accuracy of this test is 95%. That is to say given a patient has a cardiac disease, the probability of positive result of this test is 95%. In this scenario, the patient would like to know the probability that he/she has the cardiac disease given the positive test. However, the information available is the probability of testing positive if the patient has a cardiac disease, along with the fact that the patient had a positive test result.
Bayes' theorem offers a way for us to answer the above question. The basic theorem simply states:
(3.1)
where event B represents the event of our interests (e.g., having a cardiac disease) and A represents an event related to B (e.g., positive lab test). p(B|A) is the probability of event B given event A, p(A) and p(B) is the unconditional marginal probability of event A and B, respectively. The proof of Bayes' theorem is straightforward: we know from probability rules that p(A,B) = p(A|B) × p(B) and p(B,A) = p(B|A) × p(A). Given the fact that p(A,B) = p(B,A), we can easily obtain:
(3.2)
Then we divide each side by p(A), we obtain Equation (3.1). Normally, p(A|B) is given in the context of applications (e.g., the lab test accuracy in our example) and p(B|A) is what we are interested to know (e.g., the probability of having a cardiac disease given a positive lab test result). p(B) is the unconditional probability of the event of interests, which is usually assumed to be known as prior knowledge (e.g., the probability of a 50-year-old to have a cardiac disease in the population). p(A) is the marginal probability of A, which can be computed as the sum of the conditional probability of A under all possible events of B in its sample space Ω B . In our example, the sample space of B is whether a patient has a cardiac disease or not. Formally, p(A) can be computed as:
(3.3)
Now, let us come back to our cardiac disease example to make the theorem more concrete. Suppose that, in addition to the 95% accuracy of the test, we also know the false positive rate of the lab test is 40%, which means the test will produce the positive results with a probability of 40% given the patient does not have the cardiac disease. Hence, we have two possible events for B: B 1 represents the event that the patient has the cardiac disease while B 2 represents the negation. Given the accuracy and false positive rate of the test, we know p(A|B 1) = 0.95 and p(A|B 2) = 0.4. The prior knowledge p(B) is the marginal probability of a patient to have a cardiac disease, not knowing anything beyond the fact he/she is a 50-year-old. We call this information prior knowledge because it exists before the test. Suppose we know from previous research and statistics that the probability of a 50-year-old to have a cardiac disease is 5% in the population. Using all above information, we can compute p(B|A) as follows:
(3.4)
Therefore, the probability of a patient to have a cardiac disease given the positive test is only 0.111. In Bayesian theorem, we call such probability the posterior probability, because it is the estimated probability after we observe the data (i.e., the positive test result). The small posterior probability is somewhat counter-intuitive given a test with so-called "95%" accuracy. However, if we look at the Bayes' theorem, a few factors affect this probability. First is the relatively low probability to have the cardiac disease (i.e., 5%). Second is the relatively high false positive rate (i.e., 40%), which will be further enlarged by the high probability of not having a cardiac disease (i.e., 95%).
The next interesting question to ask is: what happens to Bayes theorem if we have more data. Suppose that, in the first experiment, we have data A 1. After that we repeat the experiment and have new data A 2. Let us assume that A 1 and A 2 are conditionally independent. We want to compute the posterior probability of B after two experiments: p(B|A 1,A 2). This can be done using the basic Bayes' theorem in Equation (3.1) as follows:
(3.5)
The above result tells us we can use the posterior of the first experiment (i.e., p(B|A 1)) as the prior for the second experiment. The process of repeating the experiment and recomputing the posterior probability of interest is the basic process in Bayes' analysis. From a Bayesian perspective, we start with some initial prior probability of some event of interests. Then we keep on updating this prior probability with a posterior probability that is computed using the new information we obtained from experiment. In practice, we continue to collect data to examine a particular hypothesis. We do not start each time from scratch by ignoring all previously collected data. Instead, we use our previous analysis results as prior knowledge for the new experiment to examine the hypothesis.
Let us come back to our cardiac example again. Once the patient knows the limitation of the test, he/she may choose to repeat the test. Now the patient can use Equation (3.5) to compute the new p(B) after each repeated experiment. For example, if the patient repeats the test one more time, the new posterior probability of having a cardiac disease is:
(3.6)
This result is still not very affirmative to claim the patient has a cardiac disease. However, if the patient chooses to keep on repeating the test and finding positive results, his/her probability to have a cardiac disease increases. In the following repeated test, subsequent positive results generate probabilities: test 3 = 0.41352, test 4 = 0.62611, test 5 = 0.79908, test 6 = 0.90427, test 7 = 0.95733, test 8 = 0.98158, test 9 = 0.99216, test 10 =0.99668. As we can see, after enough tests and successive positive test results, the probability to make an affirmative conclusion is pretty high. Note that the above is a toy example to demonstrate the basic Bayesian analysis. It assumes that the results of all tests are statically independent conditioned on B, which may not always be true in actual medical tests.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780128008676000030
Source: https://www.sciencedirect.com/topics/computer-science/probability-estimate
0 Response to "How to Estimate Probabilities From Data for Continuous Attributes"
Publicar un comentario