paper on computer:聚合数据的产生
www.ukthesis.org
06-25, 2015
摘要—ABSTRACT
聚合数据经过决策支持系统而产生,在他们的决策过程中或者改善组织的操作运行中,这些是由经理开发的。数据存在于事务性数据库且数据仓库并不理想,这可能扰乱决策品质。如果我们知道数据误差对数据带来的影响,对于决策,我们可能有更多的想法和准确的数据。我猜测不同数据质量维度在关系函数会有不同的影响。数据误差对标量值函数上的影响,由关系函数返回,且会通过模拟的形式估计。
介绍—Introduction
本研究探讨决策力在数据仓库方面的改进。技术,如决策支持系统(DSS)是维持在解决几个问题,尤其是那些基于量化数据和/或蓄意范围。管理者充分利用聚合数据(摘要信息)从他们的组织检索的数据库和数据仓库中获得新的运营系统或者战略决策和改善组织的运营。大多数的数据库和数据仓库获得数据是错误的,可能是人为原因或系统故障原因。决定质量将会大打折扣,当错误数据继续存在数据库中。
This research explores the improvement of Decision Making in Data warehouse. Technologies such as decision support systems (DSS) are sustaining in solving several sort of problems, particularly those that are based on quantitative data and/or are deliberate in scope. Managers make use of aggregate data (summary information) retrieved from their organizations' databases and data warehouses to make deliberate or strategic decisions to run and improve their organizations' operations. The majority of databases and data warehouses gain data errors that are take place either by people or reason by systems failure. The decision quality is loss when the data error continues living in the database. The finding and correcting data errors might be costly, resource- intensive, and often unreasonable, the need for without error data may be replaced by knowledge gained from assessment of information quality. The ideal information may not be able to attain manager from aggregate analysis but he remains some knowledge regarding the information to help or support to change business scenarios and take suitable action. A manager raise the profit, reduce the risk and estimation the quality of information with the assist of decision making (Dalvi et al , 2004). A scenario, a manager regain the information and make use of the total count of active customer that who have placed order for sure product in the past and he make plan to need stock of inventory and distribution of products. The correctness and wholeness of the customer count directly force the manager's decisions on forecasting and planning activities that might lead to above- or under-production and in-stock levels of inventory. A manager straightforwardly adjusts the planning with the assist of whole and correct information. The term data quality is a subjective idea and depends on the framework and goals of the information consumers. Frequently, subjective qualitative measures, such as low, medium, and high, are used to specify the quality of data. However, users may not distribute the same observation as to what low or high quality data pertains to. The above examples illustrate that quantitative metrics to determine the information quality would lead to more objective verdict and decisions. Hence, the main goal of this work is to provide a framework where quality characteristics of aggregate data might be precise quantitatively.#p#分页标题#e#
研究的必要性—Need for Research
A number of method have been developed for enhancing decision making but a few draw back is exist that directly force on decision quality. The first one method is a single vendor and single buyer in supply chain. Since he determines the vendor's compliant and yield rates, it manipulate the vendor-buyer decisions about the production lot size and number of shipments delivered from the vendor to buyer. It goes after, so that these decisions must be determined concurrently in order to organize the supply chain total cost. The number of unequal-sized products shipment is delivered from vendor to buyer. Furthermore, every outgoing goods is check, and each goods failing to meet a lower requirement limit is reprocessed. Our goal is achieve greatest profits and decrease the cost of products. Here we attain at a time one order and shipments it and fallow this method our business are not expended. The other method, the product delivered to multiple buyers from one vendor. The vendor produces the goods at a limited rate and customer insist occurs at each buyer at a constant rate. The goal is to decide the order quantities at the buyers and the production and shipment agenda at the vendor in order to reduce the average total cost per unit time. But a few time data error occurred moreover by people or reason by system failures. This research concerns treatment data errors and improve their organization's operation. Hence, the main goal of this work is to provide a framework where quality characteristics of aggregate data might be precise quantitatively.
研究目标—Research Objectives
Following are the objectives of this research.
Well-organized techniques design for data warehouse.
Better utilization of I.T. infrastructure.
Driving down costs.
Flexibility for decision making in data warehouse.
材料和方法—Materials and Methods
An important approach in developing a new technique is the quick functionality, followed by time depleted allocates the system to grow as more information is get together in less time and proficient way. It makes logic to have a sensible quality of requirements gathered before start the design of technique. The new technique for determining requirements for a data warehouse system is based on business dimension (Fu and Rajasekaran , 2000). It flows out of the require of the users to base their analysis on business dimensions. The new idea has as a feature the basic measurement and the business dimensions beside which the users examine these basic measurements. Using the new technique, we come up with the measurement and the appropriate dimensions that must be captured and kept in the data warehouse.
系统分析—System Analysis
A data warehouse is an information delivery system for providing information for strategic decision making. It is not a system for running day-to-day business.#p#分页标题#e#
Who are the users that can make use of the information in the data warehouse?
Where do you got for getting the requirement?
Generally users of the data warehouse can categorize as follows:
Senior executives (including the sponsors)
Key departmental managers
Business analysts
Operational systems DBA's
Other nominated by the above.
Executive will give us a sense of direction and scope for data warehouse. They are the ones closely involved in the focused area. The key departmental managers are the report to the executives in the area of focus. Business analysts are the ones who prepare reports and analyses for the executives and managers. The operational system DBA and IT applications staff give us information about the data sources for the warehouse.
What requirements we need to gather are.
Data elements: fact classes, dimensions.
Recording of data in terms of time.
Data extracts from source system.
Business rules: attributes, ranges, domains, operational records.
The Star Schema
The users can simply imagine the answers to these questions. When a query is made against the data warehouse, the results of the query are produced by unite or joining one or more dimension tables with the fact table. The association of an exacting row in the fact table is inside the rows in each dimension table. A general type of analysis is the drilling down of summary numbers to get at the details at the inferior levels. The users can easily distinguish all of this drill down analysis by evaluate the STAR schema.
数据结构—Data Structure
Disparate System has the following Relational Data Structures.
Sales Management System
Hardware Requirements
For planned system the minimum requirement for hardware are:
Server Machine Intel with Dual Processor
Hard Disk 160 GB
RAM 2 GB or above
Software Requirements
Operating System Window Server 2000 or above
Database Oracle 10g
Application Server Oracle 10g
Developer Suit 6i or above
SPSS
Results and Discussion
The nulls or missing values have been extensively studied in the perspective of information completeness within relational databases. In some study, nulls were treated as inaccurate values at tuple levels but in this study I treat nulls as incomplete values and use the following two main interpretations of nulls under the Closed World Assumption: "a) the value exists, but it is not known ,but and b) the value does not exist" (Eppler , 2003). The first null value type , indicate to as existential null, represents an attribute value within the database that is not known and thus missing from the database. Phone extension is the example of existential null that has not been entered in the database. The second type of the null value represents an attributes value that does not exist in the real world is called non-existential null.#p#分页标题#e#
Relational Aggregate functions
The evaluation and estimation of relational aggregate functions over incomplete data has been addressed by several studies in the past. My study involved a structure for evaluating the aggregate functions over imprecise data where nulls may take a partial value from a set of possible values in which exactly one is the true value. Theoretical frameworks included one for set-valued aggregate functions where attributes take their values from a set of values but did not present a formal treatment of nulls (Sarkar and Jacob , 2004). Algorithms for statistical estimation of COUNT function based on sampling were presented in and assumed that the relations do not restrain incomplete information i.e., nulls (Montgomery , 2001). My study differs from these prior works in the sense that I consider an atomic value for nulls in the relation and study their effect on the complete and accurate of scalar values returned by aggregate functions.
Ontological foundations and basic definitions
I consider relations that have an appropriately defined identifier within the relational data model. By identifier, I mean a single attribute or a composite set of attributes that individually identifies a tuple within a relation. Further, I use the Open World Assumption, which declares that there may be related tuples belong to a relation that are not present in that relation (Shankaranarayan et al , 2003) to introduce our definitions.
Let T be a conceptual relation that contain all instances of a real-world entity. All T attribute values are defined accurate and complete. Instances of T are captured into an actually stored relation S where due to some possible error-generating mechan-isms some data values in S become inaccurate, become existential nulls or some data even not being captured into S. If an inaccuracy occurs in any of the identifier attributes, then the non-identifier attribute values no longer represent a suitable piece of information about that particular identifier. In such cases, I have mismember values which do not belong to S but are present. Those instances of T that have not been captured into S form the incomplete data set denoted by SC. To demonstrate these relations, consider a conceptual sales transactions relation as shown in table 1.
The instances of T are captured into relation S where some attribute values have become invalid as shown in table 2. A number of instances of T that are not captured in S form the incomplete data set SC as shown in table 3. The set of identifier attributes is {Store_No, Prod_ID, Sales_Date} and the set of non-identifier attributes is {QTY, Sales_Amt}. In table 2, the inaccurate and existential null values are exposed by a grey background and the 'Sales_Amt Status' column is exposed for helpful purposes and is not actually stored in S.
LetK {k1, k2... km} and Q {q1, q2... qn} be the sets of identifier and non-identifier attributes of S, correspondingly. I denote an identifier attribute value by ?k and a non-identifier attribute value by ?q. Further, I use the letters A, I, N, M, and C to assign an accurate, inaccurate, existential null, mismember, or incomplete status to any attribute value. I shall use '~' for status assignment. Let t be a capricious tuple in S (table 2) for which we identify the following attribute value types.#p#分页标题#e#
Accurate identifier: An attribute value in the identifier set is defined as accurate if, all attribute values that arrange the identifier are accurate. An example of accurate identifier attribute value is {Store_No= 'S1', Prod_ID= 'P1', Sales_Date= '03-jan-08'}. The values for all attributes com-posing the identifier are accurate.
Mismember identifier: An attribute value in the identifier set is defined as mismember if, the attribute value itself is inaccurate or at least one of the other attribute values composing the identifier is inaccurate. An example of mismember identifier attribute value is {Store_No = 'S5', Prod_ID = 'P2', Sales_Date = '10-sep-08'}. The 'P2' value for Prod_ID is inaccurate, causing misidentification of the facts represented by this tuple. In other words, the sales transaction of 7000 for 40 quantity of this product in store 'S5' on sales date '10-Sep-08' has not occurred. This tuple does not belong to the relation.
Accurate non-identifier: A non-identifier attribute value is defined as accurate if, the attribute value itself alongside with all the attribute values composing the identifier is accurate. Examples of accurate non-identifier attribute values are Sales_Amt= {3000,6000,9000} that have accurate values for all their identifier values.
Inaccurate non-identifier: A non-identifier attri-bute value is defined as inaccurate if, the attribute value itself is inaccurate and all the attribute values arrange the identifier is accurate. An example of inaccurate non-identifier attribute value is Sales_-Amt=8500 that has an inaccurate value (i.e., the actual value has been 10000 but has been incorrectly recorded as 8500) and all its identifier values are accurate.
Mismember non-identifier: A non-identifier attribute value is defined as mismember if, at least one of the attribute values composing the identifier is inaccurate. Note that this applies anyway of the accurate, inaccurate, or null value of the non-identifier attribute. An example of mismember nonidentifier attribute value is Sales_Amt=7000 that one of its identifier values (i.e., Prod_ID='P2') is inaccurate. The sales amount, although accurate, has been for another product but was incorrectly recorded for 'P2'. Since product 'P2' did not have a 7000 sales in store 'S5' on '10-sep-08', the whole tuple does not belong to the relation, and then all the attribute values are mismembers.
Incomplete non-identifier: A non-identifier attribute value is defined as incomplete if, the attribute value is an existential null and all the attribute values composing the identifier are accurate. An example of incomplete non-identifier attribute value is Sales_Amt=NULL that has an existential null value (i.e., its actual value of 4000 exists but has not been recorded making it unavailable at the time of query execution) and all its identifier values are accurate. Further, all attribute values (identifier and nonidentifier) in the incomplete data set SC are by explanation incomplete but accurate (e.g., Sales_Amt= 6000 in table 3).#p#分页标题#e#
Estimations
It is effortlessly seen that the existence of inaccurate, mismember, null and incomplete attribute values have a direct force on the aggregate values. For instance, consider the following query on the Sales table (table 2):
SELECT SUM Sales_Amt FROM Sales WHERE Prod_Id='P1';
The query returns 21500 for the aggregate sum value. This, however, is not the true value because
The inaccurate value 7500 deviates from the actual value of 9000;
The mismember value 6000 put in to this aggregate whereas it should not;
The existential null value does not put in to the sum whereas its true value of 4000 should;
The values of 5000 and 9000 in the incomplete data set do not put in to the sum whereas they should.
Accounting for all the accurate values, the true aggregate sum value for this query is 28000 which deviate about 15% from the query effect (show in Figure 1). It is, therefore, necessary that the number of inaccurate, existential null, mismember, and incomplete values for each attribute be obtained in order to adjust the query result for the errors caused by these values. Auditing every single value in a database or data warehouse table that usually contains very large numbers of rows and attributes is expensive and unrealistic. Instead, sampling strategies can be used to estimate these errors as described next.
Sampling strategies
I separate between sampling plan for the identifier and non-identifier attributes and clearly suppose that the sampled data can be confirmed for their actual values. Note that in a relational model, identifiers cannot take null values, and all sampling plan are clear for non-empty S. Further, in this study, I shall focus on the point estimates that make simpler my analysis. At the same time, I acknowledge that interval estimates might be very precious in decision-making processes, but we put down them for future research.
Count
The COUNT function provides the number of tuples in S or the number of not-null values in a single attribute. Without considering the mismembers and the incomplete data set; the COUNT function provides |S|. When COUNT is used to retrieve the cardinality of S or it functions on one of the identifier attributes, the true COUNT, denoted by COUNT T, is the number of tuples with accurate identifiers plus the cardinality of the incomplete set: When COUNT operates on one of the non-identifier attributes, the true count is the sum of accurate, inaccurate, and incomplete (i.e., existential nulls and incomplete data set) values:
Max and Min
The MAX function runs on a single attribute or a substring. When MAX run on one of the identifier attributes, then the probability that the returned value is accurate. When MAX function run on a non-identifier attribute, then probability that the returned value is reinstate by an existential null or is in the incomplete data set is returned. Analogously, all the MAX function expressions apply for the MIN function.#p#分页标题#e#
Sum
The distributions of attribute value types within their essential domains affect the estimation of the true SUM value. The attribute value types might have a uniform or skewed distribution depending on the error-generating processes. We put together the following assumptions about the error-generating processes.
Assumption 1: The error-generating processes that reason errors in values of each attribute are not systematic. This assumption means that I do not have a priori knowledge about reason that produces errors in data because if I had such knowledge then I would simply reduce them.
Assumption 2: On average, the data loads, updates, and refreshing cycles do not affect the proportions of value types in attributes. I mean that the error-generating processes would produce the same proportions of errors every time data are captured in the relation. In other words, I do not expect the error proportions to become unusual except I have particular knowledge about change or elimination of the error-generating processes.
Uniform attributes value distribution
In order to estimate the exact sum value on a particular attribute (either identifier or non-identifier), I gain a representative sample of attribute values and validate the average of accurate values. This average will vary each time I repeat the sampling. thus, I repeat the sampling enough times so that the average values shall have a Normal distribution according to the central limit theorem. After that, the average of all averages will be a converged value to be used as the average of correct values. I denote the converged average correct values for the identifier and non-identifier attributes respectively. When SUM run on an identifier attribute, the values in the incomplete data set are not directly accessible and not contributing to the sum. Similarly, when SUM run on a non-identifier attribute, the estimate for the true SUM value can be gain by substituting the inaccurate, existential nulls and incomplete values.
Average
The estimated true value returned by the AVERAGE function on an identifier (non-identifier) attribute is given by the ratio of the estimated true SUM and true count. Aveage=sum /count ;
Numeric examples with simulation
I show our methodology using an example of simulation. I consider a conceptual relation T and its stored and incomplete data sets to be similar to those shown in table 1 - 3. I occupied T with 2500 tuples and S with 2575 tuples (implying 125 mismembers), and included 250 tuples in SC. The values of Sales_Amt for 'P1' were generated using a random number generator and capture their values uniformly from a range of [1000,10 000]. We ran the following query
SELECT SUM(Sales_Amt) FROM Sales WHERE Prod_ID='P1' ;
on T where the return actual sum is 14,629,139 and the average is return 5418 (show in Figure 2). Next, I randomly selected and altered 500 values of Sales_Amt in S and labeled them as inaccurate, randomly selected 250 values and marked them as mismembers, and converted 100 values to existential nulls. The size of random sample of values to be in use from S was calculated to be 465 (Thomson , 2002) but I select to sample a round number of 500 values and repeated the sampling 100 times. This sample return the total sum of accurate, inaccurate, mismember, and existential null value is 2768901 and also average value is 5127. The accurate identifier in the sample was 465 and return total is 2555462 and the average is 5495. The averages of accurate values were found to be in the range of [5000,6000] and the converged average over this range was 5495 (show in Figure 3).#p#分页标题#e#
The incomplete and mismembers data set have the maximum force on the sum value and, therefore, on the effectiveness of the estimation. The greater the number of mismembers and incomplete data set, the more helpful the estimated true value of the sum becomes since the returned sum by query does not reflect the force of incomplete and mismember values, but the estimated value does.
Summary
Conclusion and future research in this work, I have argued that aggregated information used by managers in their decision making processes might suffer from data errors which formulate an effect on decision quality. I have provided a framework for proper definitions of attribute value types (i.e., accurate, inaccurate, mismember, and incomplete) within the relational data model. Then, I presented sampling strategies to find out the maximum probability estimates of these value types in the whole data population residing in databases or data warehouses. The maximum probability estimates were used in our metrics to estimate the true values of scalars returned by the relational aggregate functions. Finally, I demonstrated my methodology with numerical examples simulations. The simulation results demonstrate that my estimations of the true values are efficient enough for most useful purposes. my study measured unbiased point estimates to obtain the quality metrics but it would also be interesting to use interval estimates that provide the standard deviation for the sampled averages and explore their effects on the returned scalar values. This study can further be extended to estimate the aggregations returned by the broadly used Group By clause, partial sum, and the OLAP functions such as Roll Up and Drill Down.
Refrences
Dalvi N., D. Suciu and E.Cient (2004) Query Evaluation on Probabilistic Databases. In Proc. VLDB, pp. 864-875, 2004.
Eppler M.J.(2003) Managing Information Quality, Springer Verlag, Berlin.
Fu L. and S. Rajasekaran (2000) Division of Computer Science, Department of Mathematical Sciences, University of North Carolina at Greensboro, Bryan 383, Greensboro,NC 27402-6170, USA.
Montgomery D.C.(2001) Introduction to Statistical Quality Control, 4th ed., Wiley.
Sarkar S. and V.S. Jacob (2004), Assessing data quality for information products: impact of selection, projection, and Cartesian product, Management Science,Itly.
Shankaranarayan G. , M. Ziad and R.Y.Wang (2003) Managing data quality in dynamic decision environments: an information product approach, Journal of Database Management 14 (4) (2003) 14-32.
Thomson S.K.(2002) Sampling, 2nd ed.,Wiley Interscience, NewYork.
如果您有论文代写需求,可以通过下面的方式联系我们
点击联系客服