![]() ![]() You can see that the file has 66 stripes ranging from 70 MB (50 MB for the last tailing stripe) to 150 MB. $ hive -orcfiledump s3://cloudsqale/hive/clicks.db/event_dt=/part-m-00029 Let’s take a single file and examine its structure: $ aws s3 ls s3://cloudsqale/hive/clicks.db/event_dt=/ -summarize I have 143 GB of daily data for clicks located in 33 files in ORC format: When the data are inserted with hive/tez I can query them from Presto without any error, and of course I get the same number of lines with both method.In the previous article I already wrote about splits generation (see Tez Internals #2 – Number of Map Tasks for Large ORC Files), and here I would like to share some more details. I was thinking that maybe some of the data in the original parquet file were creating the issue, so I created the same exact table but then inserted the data with hive/Tez (that process is also going fine without any error). If I do a select columns1,columns2 with the same predicate the query will succeed. Seek past end of stream Īt io.seekToCheckpoint(CompressedOrcChunkLoader.java:95)Īt io.seekToCheckpoint(OrcInputStream.java:175)Īt io.seekToCheckpoint(ByteArrayInputStream.java:50)Īt io.seekToCheckpoint(ByteArrayInputStream.java:22)Īt io.openStream(CheckpointInputStreamSource.java:57)Īt io.openRowGroup(SliceDirectStreamReader.java:243)Īt io.readBlock(SliceDirectStreamReader.java:104)Īt io.readBlock(SliceStreamReader.java:77)Īt io.(OrcRecordReader.java:410)Īt io.OrcPageSource$OrcBlockLoader.load(OrcPageSource.java:231) Seek past end of stream Īt io.OrcPageSource$OrcBlockLoader.load(OrcPageSource.java:235)Īt io.OrcPageSource$OrcBlockLoader.load(OrcPageSource.java:209)Īt io.assureLoaded(LazyBlock.java:277)Īt io.getLoadedBlock(LazyBlock.java:268)Īt io.$recordMaterializedBytes$0(PageUtils.java:42)Īt io.$tupDictionaryBlockProjection(DictionaryAwarePageProjection.java:211)Īt io.$DictionaryAwarePageProjectionWork.lambda$getResult$0(DictionaryAwarePageProjection.java:197)Īt io.$ProjectSelectedPositions.processBatch(PageProcessor.java:341)Īt io.$ProjectSelectedPositions.process(PageProcessor.java:204)Īt io.$ProcessWorkProcessor.process(WorkProcessorUtils.java:373)Īt io.$flatten$6(WorkProcessorUtils.java:278)Īt io.$3.process(WorkProcessorUtils.java:320)Īt io.$3.process(WorkProcessorUtils.java:307)Īt io.(WorkProcessorUtils.java:221)Īt io.$processStateMonitor$2(WorkProcessorUtils.java:200)Īt io.$finishWhen$3(WorkProcessorUtils.java:215)Īt io.(WorkProcessorSourceOperatorAdapter.java:148)Īt io.(Driver.java:379)Īt io.$processFor$8(Driver.java:283)Īt io.(Driver.java:675)Īt io.(Driver.java:276)Īt io.$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)Īt io.process(PrioritizedSplitRunner.java:163)Īt io.$n(TaskExecutor.java:484)Īt io.prestosql.$gen.Presto_318_20190829_120913_1.run(Unknown Source)Īt .runWorker(ThreadPoolExecutor.java:1149)Īt $n(ThreadPoolExecutor.java:624)Ĭaused by: io.: Malformed ORC file. The orc file are created, however if now I try to do a Select * with a predicate on that table I will get this: INSERT INTO xxxx_prod_log_orc SELECT * FROM xxxx_prod_log_parquet WHERE logdate=('') I then inserted data from my original table (parquet files) with Presto and this doesn't result in any error: ![]() This is an internal hive table and here is the end of the create table in hive: I'm now trying to create a new table with Presto that take the original parquet file and keep the partition but that add buckets and a bloom filter. I have absolutely no issue to query this external table (through hive) with Presto. I have a set of parquet file partitioned by date containing about 50 millions records per day with about 800 columns. I will try to summarize the issue as best as I can but unfortunately I'm not able to copy some data to illustrate the issue. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |