Using Baysian belief networks and process metadata to address large scale data integration problems

Polak, John; Krishnan, Rajesh; Lindveld, Charles; Logie, Miles; Westlake, Andrew; Axhausen, Kay W.; Cornelis, Eric; Collop, Mike; Haupt, Thomas

Show simple item record

dc.contributor.author

Polak, John

dc.contributor.author

Krishnan, Rajesh

dc.contributor.author

Lindveld, Charles

dc.contributor.author

Logie, Miles

dc.contributor.author

Westlake, Andrew

dc.contributor.author

Axhausen, Kay W.

dc.contributor.author

Cornelis, Eric

dc.contributor.author

Collop, Mike

dc.contributor.author

Haupt, Thomas

dc.date.accessioned

2020-10-14T05:01:42Z

dc.date.available

2017-06-11T17:01:48Z

dc.date.available

2020-10-14T04:58:43Z

dc.date.available

2020-10-14T05:01:42Z

dc.date.issued

2006

dc.identifier.isbn

1-905701-01-2

en_US

dc.identifier.isbn

978-1-905701-01-8

en_US

dc.identifier.uri

http://hdl.handle.net/20.500.11850/100095

dc.description.abstract

An increasingly important problem affecting many areas of transport planning, operations and management is the need to combine information from a variety of different data sources in order to provide the best possible estimate of certain parameters of interest. Problems of this type arise for a variety of reasons. • No single data source contains sufficient information by itself. • Multiple data sources naturally arise (e.g. through observations at different levels of spatial or temporal aggregation or by means of different survey methods), resulting in a need to reconcile potentially conflicting estimates. • The need to update or transfer an existing set of data and parameter estimates when additional information becomes available. Although methods have been developed for several specific instances of problems arising in different areas of transport studies (e.g., for O-D matrix estimation, synthetic population generation, network performance estimation) there does not yet exist a coherent set of general purpose methods for dealing with data combination problems. Moreover, due to the lack of appropriate general purpose techniques, data integration is often in practice undertaken in an ad hoc fashion, potentially resulting both in a loss of efficiency and exposing the analysis to the risk of biases of various sorts. In this paper, we argue that in order to address problems of this sort, innovation at two levels is required in current practice. The first is in the methods used for the consistent description and management of transport data sources and modelling processes and the second is in the methods used for the characterisation and propagation omanage information flows. f data and modelling uncertainty during analysis. We describe the development of a general Bayesian framework for data integration problems of this sort and associated process metadata tools to support model application. This framework is designed to enable the use of existing structural knowledge (in the form of existing transport models) and existing measurement knowledge (in the form of characterisations of sampling and non-sampling errors) to inform the data integration task. The transport system of interest is characterised by a state vector X, which might e.g., represent an O-D matrix or a set of link flows or a set of link travel times or combinations of some or all of these quantities. Full information about the system is provided by the probability distribution P(X). The data integration problem arises because, in general, complete observations of realisations of the state vector X will not be available. Instead, what can be observed are realisations of another stochastic vector Y, which is related to X. In the simplest (a direct measurement) case, Y is related to X through a measurement process characterised by sampling and non-sampling variation. The vector Y may contain several direct measurements of the same underlying state X, arising for example from the application of different measurement methods. Direct measurements are complemented by indirect measurements; comprising observations of realisations of quantities that are distinct from but structurally related to the state vector X. Thus the vector Y will in general be a combination of direct and indirect measurements on the state vector X and will embody both measurement and structural information. Given this setup, the data integration problem is to determine the best estimate of P(X) given the observations Y and (possibly) a prior estimate of X. We show that this formulation of the data integration problem subsumes a number of existing problems in the literature. Our approach to addressing this problem is to encode existing domain structural and measurement knowledge (which we term our general a priori model or GAPM) in the form of a Bayesian belief network (BBN) and to use the BBN representation of the GAPM to compute the posterior distribution of X conditional on Y. Except in very specific special cases, this posterior distribution will not be available in closed form, so the properties of the posterior must be determined empirically. There are two key advantages in this context to adopting a Bayesian approach. The first is that, in principle, it allows us to treat both observational information from sample surveys (and other data sources) and information encoded in structural and measurement modelling assumptions in the GAPM in a consistent fashion. The second is that recent developments in computational Bayesian techniques, in particular the emergence of Markov Chain Monte Carlo methods, provide a rich set of tools to enable the sampling from complex posterior distributions (e.g., the Hastings-Metropolis and Gibbs samplers). Notwithstanding these recent developments however, the practical implementation of this approach still poses considerable challenges, especially in dealing with the high dimensionality of typical transport network problems. A number of strategies for dealing with this problem are discussed, including: re-parameterisation of the state space, hierarchical geographical decomposition of the BBN, functional partitioning of the BBN, parallelisation of the standard samplers, and the development of special purpose samplers that exploit the characteristics of particular problems. Alongside the modelling work, the project has developed a metadata framework for characterising data inputs and model processing and storing a complete audit trail, covering the specification and fitting of statistical models. This addresses key concerns regarding the provenance and reliability of model-based estimates. The structure of the paper is as follows. Following a brief introduction, the second section sets out a general description of the data integration problem and describes the Bayesian approach to addressing the problem. In the third section we discuss some practical considerations, particularly the problem of dimensionality and present a number of special purpose samplers that have been developed to deal with common transport applications. The fourth section discusses the metadata issues, focusing in particular on motivation for this approach and describing the process metadata tools developed to support the modelling. The fifth section illustrates the application of the methods in a number of case studies including (a) the use of data from household surveys, census records and network flow counts to produce augmented O-D matrices and (b) synthetic population generation. The paper concludes with a general discussion of the potential directions for future research work in this area.

en_US

dc.language.iso

en

en_US

dc.publisher

Association for European Transport

en_US

dc.title

Using Baysian belief networks and process metadata to address large scale data integration problems

en_US

dc.type

Other Conference Item

ethz.book.title

Proceedings - European Transport Conference 2006: 18-20 September, Palais de la Musique et des Congrès, Strasbourg, France

en_US

ethz.size

1 p.

en_US

ethz.event

European Transport Conference 2006 (ETC 2006)

en_US

ethz.event.location

Strasbourg, France

en_US

ethz.event.date

September 18, 2006

en_US

ethz.publication.place

London

en_US

ethz.publication.status

published

en_US

ethz.leitzahl

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02115 - Dep. Bau, Umwelt und Geomatik / Dep. of Civil, Env. and Geomatic Eng.::02610 - Inst. f. Verkehrspl. u. Transportsyst. / Inst. Transport Planning and Systems::03521 - Axhausen, Kay W. (emeritus) / Axhausen, Kay W. (emeritus)

en_US

ethz.leitzahl

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02100 - Dep. Architektur / Dep. of Architecture::02655 - Netzwerk Stadt und Landschaft D-ARCH::02226 - NSL - Netzwerk Stadt und Landschaft / NSL - Network City and Landscape

en_US

ethz.leitzahl

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02100 - Dep. Architektur / Dep. of Architecture::02655 - Netzwerk Stadt u. Landschaft ARCH u BAUG / Network City and Landscape ARCH and BAUG

en_US

ethz.leitzahl.certified

ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02115 - Dep. Bau, Umwelt und Geomatik / Dep. of Civil, Env. and Geomatic Eng.::02610 - Inst. f. Verkehrspl. u. Transportsyst. / Inst. Transport Planning and Systems::03521 - Axhausen, Kay W. (emeritus) / Axhausen, Kay W. (emeritus)

ethz.date.deposited

2017-06-11T17:02:46Z

ethz.source

ECIT

ethz.identifier.importid

imp5936531a269b252880

ethz.ecitpid

pub:156699

ethz.eth

yes

en_US

ethz.availability

Metadata only

en_US

ethz.rosetta.installDate

2017-07-12T14:28:06Z

ethz.rosetta.lastUpdated

2024-02-02T12:18:44Z

ethz.rosetta.exportRequired

true

ethz.rosetta.versionExported

true

ethz.COinS

ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Using%20Baysian%20belief%20networks%20and%20process%20metadata%20to%20address%20large%20scale%20data%20integration%20problems&rft.date=2006&rft.au=Polak,%20John&Krishnan,%20Rajesh&Lindveld,%20Charles&Logie,%20Miles&Westlake,%20Andrew&rft.isbn=1-905701-01-2&978-1-905701-01-8&rft.genre=unknown&rft.btitle=Proceedings%20-%20European%20Transport%20Conference%202006:%2018-20%20September,%20Palais%20de%20la%20Musique%20et%20des%20Congr%C3%A8s,%20Strasbourg,%20France

Search print copy at ETH Library

Files in this item

Files	Size	Format	Open in viewer
There are no files associated with this item.

Publication type

Other Conference Item [19262]

Show simple item record

Research Collection

Search

Using Baysian belief networks and process metadata to address large scale data integration problems Mendeley CSV RIS BibTeX

Files in this item

Publication type

Using Baysian belief networks and process metadata to address large scale data integration problems

Mendeley

CSV

RIS

BibTeX