Using Baysian belief networks and process metadata to address large scale data integration problems
dc.contributor.author
Polak, John
dc.contributor.author
Krishnan, Rajesh
dc.contributor.author
Lindveld, Charles
dc.contributor.author
Logie, Miles
dc.contributor.author
Westlake, Andrew
dc.contributor.author
Axhausen, Kay W.
dc.contributor.author
Cornelis, Eric
dc.contributor.author
Collop, Mike
dc.contributor.author
Haupt, Thomas
dc.date.accessioned
2020-10-14T05:01:42Z
dc.date.available
2017-06-11T17:01:48Z
dc.date.available
2020-10-14T04:58:43Z
dc.date.available
2020-10-14T05:01:42Z
dc.date.issued
2006
dc.identifier.isbn
1-905701-01-2
en_US
dc.identifier.isbn
978-1-905701-01-8
en_US
dc.identifier.uri
http://hdl.handle.net/20.500.11850/100095
dc.description.abstract
An increasingly important problem affecting many areas of transport planning, operations and management is the need to combine information from a variety of different data sources in order to provide the best possible estimate of certain parameters of interest. Problems of this type arise for a variety of reasons. • No single data source contains sufficient information by itself. • Multiple data sources naturally arise (e.g. through observations at different levels of spatial or temporal aggregation or by means of different survey methods), resulting in a need to reconcile potentially conflicting estimates. • The need to update or transfer an existing set of data and parameter estimates when additional information becomes available. Although methods have been developed for several specific instances of problems arising in different areas of transport studies (e.g., for O-D matrix estimation, synthetic population generation, network performance estimation) there does not yet exist a coherent set of general purpose methods for dealing with data combination problems. Moreover, due to the lack of appropriate general purpose techniques, data integration is often in practice undertaken in an ad hoc fashion, potentially resulting both in a loss of efficiency and exposing the analysis to the risk of biases of various sorts. In this paper, we argue that in order to address problems of this sort, innovation at two levels is required in current practice. The first is in the methods used for the consistent description and management of transport data sources and modelling processes and the second is in the methods used for the characterisation and propagation omanage information flows. f data and modelling uncertainty during analysis. We describe the development of a general Bayesian framework for data integration problems of this sort and associated process metadata tools to support model application. This framework is designed to enable the use of existing structural knowledge (in the form of existing transport models) and existing measurement knowledge (in the form of characterisations of sampling and non-sampling errors) to inform the data integration task. The transport system of interest is characterised by a state vector X, which might e.g., represent an O-D matrix or a set of link flows or a set of link travel times or combinations of some or all of these quantities. Full information about the system is provided by the probability distribution P(X). The data integration problem arises because, in general, complete observations of realisations of the state vector X will not be available. Instead, what can be observed are realisations of another stochastic vector Y, which is related to X. In the simplest (a direct measurement) case, Y is related to X through a measurement process characterised by sampling and non-sampling variation. The vector Y may contain several direct measurements of the same underlying state X, arising for example from the application of different measurement methods. Direct measurements are complemented by indirect measurements; comprising observations of realisations of quantities that are distinct from but structurally related to the state vector X. Thus the vector Y will in general be a combination of direct and indirect measurements on the state vector X and will embody both measurement and structural information. Given this setup, the data integration problem is to determine the best estimate of P(X) given the observations Y and (possibly) a prior estimate of X. We show that this formulation of the data integration problem subsumes a number of existing problems in the literature. Our approach to addressing this problem is to encode existing domain structural and measurement knowledge (which we term our general a priori model or GAPM) in the form of a Bayesian belief network (BBN) and to use the BBN representation of the GAPM to compute the posterior distribution of X conditional on Y. Except in very specific special cases, this posterior distribution will not be available in closed form, so the properties of the posterior must be determined empirically. There are two key advantages in this context to adopting a Bayesian approach. The first is that, in principle, it allows us to treat both observational information from sample surveys (and other data sources) and information encoded in structural and measurement modelling assumptions in the GAPM in a consistent fashion. The second is that recent developments in computational Bayesian techniques, in particular the emergence of Markov Chain Monte Carlo methods, provide a rich set of tools to enable the sampling from complex posterior distributions (e.g., the Hastings-Metropolis and Gibbs samplers). Notwithstanding these recent developments however, the practical implementation of this approach still poses considerable challenges, especially in dealing with the high dimensionality of typical transport network problems. A number of strategies for dealing with this problem are discussed, including: re-parameterisation of the state space, hierarchical geographical decomposition of the BBN, functional partitioning of the BBN, parallelisation of the standard samplers, and the development of special purpose samplers that exploit the characteristics of particular problems. Alongside the modelling work, the project has developed a metadata framework for characterising data inputs and model processing and storing a complete audit trail, covering the specification and fitting of statistical models. This addresses key concerns regarding the provenance and reliability of model-based estimates. The structure of the paper is as follows. Following a brief introduction, the second section sets out a general description of the data integration problem and describes the Bayesian approach to addressing the problem. In the third section we discuss some practical considerations, particularly the problem of dimensionality and present a number of special purpose samplers that have been developed to deal with common transport applications. The fourth section discusses the metadata issues, focusing in particular on motivation for this approach and describing the process metadata tools developed to support the modelling. The fifth section illustrates the application of the methods in a number of case studies including (a) the use of data from household surveys, census records and network flow counts to produce augmented O-D matrices and (b) synthetic population generation. The paper concludes with a general discussion of the potential directions for future research work in this area.
en_US
dc.language.iso
en
en_US
dc.publisher
Association for European Transport
en_US
dc.title
Using Baysian belief networks and process metadata to address large scale data integration problems
en_US
dc.type
Other Conference Item
ethz.book.title
Proceedings - European Transport Conference 2006: 18-20 September, Palais de la Musique et des Congrès, Strasbourg, France
en_US
ethz.size
1 p.
en_US
ethz.event
European Transport Conference 2006 (ETC 2006)
en_US
ethz.event.location
Strasbourg, France
en_US
ethz.event.date
September 18, 2006
en_US
ethz.publication.place
London
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02115 - Dep. Bau, Umwelt und Geomatik / Dep. of Civil, Env. and Geomatic Eng.::02610 - Inst. f. Verkehrspl. u. Transportsyst. / Inst. Transport Planning and Systems::03521 - Axhausen, Kay W. (emeritus) / Axhausen, Kay W. (emeritus)
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02100 - Dep. Architektur / Dep. of Architecture::02655 - Netzwerk Stadt und Landschaft D-ARCH::02226 - NSL - Netzwerk Stadt und Landschaft / NSL - Network City and Landscape
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02100 - Dep. Architektur / Dep. of Architecture::02655 - Netzwerk Stadt u. Landschaft ARCH u BAUG / Network City and Landscape ARCH and BAUG
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02115 - Dep. Bau, Umwelt und Geomatik / Dep. of Civil, Env. and Geomatic Eng.::02610 - Inst. f. Verkehrspl. u. Transportsyst. / Inst. Transport Planning and Systems::03521 - Axhausen, Kay W. (emeritus) / Axhausen, Kay W. (emeritus)
ethz.date.deposited
2017-06-11T17:02:46Z
ethz.source
ECIT
ethz.identifier.importid
imp5936531a269b252880
ethz.ecitpid
pub:156699
ethz.eth
yes
en_US
ethz.availability
Metadata only
en_US
ethz.rosetta.installDate
2017-07-12T14:28:06Z
ethz.rosetta.lastUpdated
2024-02-02T12:18:44Z
ethz.rosetta.exportRequired
true
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Using%20Baysian%20belief%20networks%20and%20process%20metadata%20to%20address%20large%20scale%20data%20integration%20problems&rft.date=2006&rft.au=Polak,%20John&Krishnan,%20Rajesh&Lindveld,%20Charles&Logie,%20Miles&Westlake,%20Andrew&rft.isbn=1-905701-01-2&978-1-905701-01-8&rft.genre=unknown&rft.btitle=Proceedings%20-%20European%20Transport%20Conference%202006:%2018-20%20September,%20Palais%20de%20la%20Musique%20et%20des%20Congr%C3%A8s,%20Strasbourg,%20France
Dateien zu diesem Eintrag
Dateien | Größe | Format | Im Viewer öffnen |
---|---|---|---|
Zu diesem Eintrag gibt es keine Dateien. |
Publikationstyp
-
Other Conference Item [19381]